When I was Director of Cloud and Automoation Services at Disney, the group I reported into was a Disney-specific Internet Service Provider for the other business units (and there were a LOT of them). The rough goal of that position was “making it possible to run more things for business units, cheaper, faster, better”. It was the first stabs of that group doing self-service cloud-like capabilities for the rest of the organization, and ultimately led me onward to OpenStack, Nebula, and other interesting technical challenges.
One of the early challenges (that is still a challenge today) was applying some of the operational best practices to running services when doing it via “self-service”. The classic ops model (which I suspect is still there today, just mixed in with other options) used the “big guys” – HP OpenView, BMC, Tivoli, etc. What we wanted to do was enable a portion of what I was internally calling our “Data Center API” for the business unit developers. The goal was to make it possible for a developer to set up monitoring for their application without having to open a service desk ticket, which then led to some phone or email conversation, nailing down specifics of what should be monitoring, what thresholds applied, or even reaching in to the more complex areas beyond simple metrics and thresholds. In short, simplify it all down, and get the human process out of the way.
The licensing models for all of those products were (and still are) complete shit for self-service interactions and growth. They were also all configuration driven (not API driven), and some of them only allowed those configurations to be updated with a graphic user interface. Most of my vitriol was used on the ones that required Win32 applications to update configurations of those tools. We ended up doing some development around a cheaper option with a site-licensed model so that we didn’t have the incremental cost growth issues. In the end, the first cut was as simple as “we’ll ping test the VM when we set it up for you, and you can have us verify one or more arbitrary HTTP urls are active and send alerts (emails, in that case) if they aren’t”. It was all imperatively driven by the request to create the VM, and editable after the fact through an exposed-to-the-developer-web-interface. Not bad, really – although I wanted to make it a lot more.
Fast forward to today, some five years later, with microservice deployments exploding the scale footprint and diversity for monitoring – both metrics data and logs. PaaS services like Heroku’s add-on marketplace and AWS feature gaps giving services like Splunk, DataDog, New Relic, or DynaTrace a chance to exist, grow, and evolve into powerful tool sets. The open source/DIY side also exploded with the ELK stack, Graphite, Nagios, InfluxData, and Zenoss all mixing open source and commercially supported projects. Most recently tooling like Prometheus, Grafana, InfluxDB or James Turnbull’s book on the Art of Monitoring (including Riemann) have been grabbing a lot of attention, very deservedly so.
What caught my eye (and incited this post) is Prometheus’s advances, now released as 1.0 (although they’ve been pretty damn stable for a while). They followed the same keep-it-simple poll/scraping path that we did back at Disney, and have some blog posts related to scaling in that configuration. I have personally come to favor direct event driven reactions to reduce the cascade of added latencies and race conditions that you get the polling, but for the trailing-response sort of monitoring mechanisms, you can’t argue with their success and ability to scale it.
Most interestingly to me, they are implicitly encouraging a change in the scope of the problem. While they can scale to the “Missouri River in ’93” floods of data that monitoring an entire datacenter can deliver, they’ve defaulted to scoping down to the default area they’re covering to a complex application/service made up of microservices. One per, thank you very much. Prometheus is solved the setup and configuration problem by embedded into the Kubernetes service discovery mechanisms. As it changes its pods, data collection keeps rolling in as new individual services are added and removed. I suspect comparable orchestrastion technologies like Docker swarm, Terraform, and Marathon have similiar mechanisms, and AWS CloudFormation has included simple monitoring/alerting for ages within their basic API.
It still means another API to write to/work with – alert templates and such – and the client libraries that are available are the real win – available in any one of the major languages you might use in even the most diverse polyglot development environments. It’s the second most important component in the orchestration game – just getting the service up and running being the first.
This is taking a datacenter API concept and scaling it down so that it can run on a laptop for the developer. Prometheus, Kubernetes or Docker, etc – can all be run locally so you can instrument your application while you’re writing it, get real-time feedback on what and how its doing while you develop. Rather than just deploying your code, you can deploy your code next to an ELK stack configured for your services, and now monitoring with Prometheus as well.
This is the sort of functionality that I expect a Platform as a Service (Heroku, CloudFoundry, OpenShift, etc) to provide. Clear APIs and functionality to capture metrics and logs from my running application, to the point that you could create integration testing bound into these services. In my ideal world, these tools can capture and benchmark resource utilization during functional integration and regression tests, to annotate the pass/fail markers and acually provide, you know, metrics against which you can optimize and base real development decisions (which are also business decisions if you’re doing it right).
The closer we can move the detail needed to run an application “in production” to the developer, the bigger the win – all the details of testing, validation, and operational performance need to be at hand, as well as a clear means of using them.