img_0049

China, October 2016

I’ve been quiet on this blog for a few weeks, mostly due to travel. I recently completed a wonderful trip to China, a tour we’ve been planning for the better part of several years. I’m sure I’ll get back to technical topics but for now I wanted to put up some of the photos from our amazing trip.

Between Karen, my mother, and myself, we took nearly 1500 photos. Picking the iconic or even relevant photos from that set of memories… well, just wow. That’s tough. Fair warning, this will be almost all photos…

IMG_0019.jpgObligatory selfie with my love, at the Temple of Heaven in Beijing

I’ve had a side interest in the art of stone carving for a long time, and I got to definitely enjoy some exquisite examples while I was at the Temple of Heaven

IMG_0021.jpgIMG_0039.jpg

IMG_0072.jpg

That extends into architectural details and ornamentation.

IMG_0061.jpgIMG_0077.jpg

The details that we saw are innumerable. Just as representative, perhaps better representative, of the trip was the sense of vastness and scale. Texan’s may think they “build big”, but really – they’re pretty weak punks when it comes to China.

IMG_0147.jpgPanorama from the Great Wall of China

 

IMG_2901.jpgPanorama from the Forbidden City

IMG_3085.jpgPanorama from Xi’an and the archeological dig where they are unearthing the terra cotta warriors.

But of all the places, I think the most visually stunning wasn’t built. IMG_3299.jpgPanorama from Guangxi, China on the Lichiang river. What’s often called “The gumdrop mountains” in english.

One blog post would never do it justice. 12 days in a whirlwind tour, touching down in quite a few different parts of a hugely diverse country.Screen Shot 2016-11-19 at 10.50.42 AM.png

We had some goofy fun times, such as riding the maglev in Shanghai, and my oh-so-authentic terror of being on a train moving at 431km/hr (it was really cool!)

IMG_0953.jpg

Traveling with our fellow tour mates and getting to know other people with interest in China as well.

IMG_0688.jpgThis is Karen and I with David – youngest member of our tour – at the dig in Xi’ian.

IMG_0859.jpgBeing with family to have this experience (Karen, myself, and mom)

IMG_0423.jpg

Meeting new people who were willing and interested to chat with curious tourists.

IMG_0255.JPGIncluding quite a number of school children who were all very excited to practice a little english and pleasantries.

Quite an amazing trip, walking through that 大門 to see China.

IMG_0049.jpg

Wallowing in research papers

I started my first job at the University from which I graduated. It was an strange time, a light economic depression and “peace” had just broken out. The cold war was effectively over, the Berlin wall came down, and it wouldn’t be long until the Soviet Union dissolved into component states. I previously excelled in my studies of CMOS chip design never useing that skill in any job. Instead what was available was software and system administration. An evening gig at a campus “help desk” is where I started. In that environment, I had access to clusters of NeXT computers, SGI Toasters with Irix, some standalone IBM RS6000 running AIX, a VM/CMS mainframe, and rooms and rooms of PCs and Macs. You couldn’t ask for a more fertile ground for learning.

One the CS professors doubled down as an early job mentor (as well as being “the security guy” for University): Greg Johnson. When I was first stepping into unix and trying to get a handle on it, he provided some sage advice that stuck with me:

Reading man pages is a skill. It’s something you can practice at and learn, and there is a treasure trove of information there.

For those who don’t know man (short for manual) pages were (are) the built-in documentation to unix – accessible using the command man – they share information from how to use commands on the command line to details about how various software libraries function. You can easily find them – any linux OS probably has them installed, or Macs – just open a terminal or SSH session and use a command like “man ls” and you’ll get a gist of what’s there – as well as why it’s a skill to read them.

I took Greg’s advice to heart and focused on learning to read man pages. It may have been the moral equivalent of learning to read by going through a dictionary, but it worked. I wallowed in man pages, kept following down leads and terms referenced in that green-screen text output. It started as a confused mess of words that didn’t mean anything. After seeing them repeatedly, and wrapping some context around some, I was able to sort out what various terms meant.

Twenty five years later, I find myself learning using the same process. Different set of confusing words, and a different topic to learn, but things are coming in to focus. Over the past six months (well, since Deep Mind won the Go games over Lee Sodel) I’ve been wanting to understand machine learning and the current state of research. The topic is so new, so rapidly unfolding, that other forms of documentation are pretty limited. There are articles and how-to’s out there for Keras, TensorFlow, and others – but understanding the topic is different than learning how to use the tools.

Wallowing in academic research papers is as messy as the man pages were in the “unix wars” era. The source these days is arXiv, which hosts scientific papers in prepublication form as a common repository for academic research and citation. They have a handy web interface that let’s you list the papers for the past week:

https://arxiv.org/list/cs.AI/pastweek

Every weekend, I refresh that page in a browser, browse through the titles, and grab and stash PDFs into evernote for later reading. Every now and then, I find a “survey of…” paper which is a gold mine of references. From there either arXiv or Google Scholar can help seed up finding these additional papers or areas of interest. Wikipedia has been a godsend in turning some of the jargonistic terms into plain english. An AI issue such as vanishing gradient descent gets a lot more tractable in wikipedia, where volunteers explain the issue much more simply than the strict correctness of academic papers.

I hit one of those “magic” papers just last weekend.  Entitled “A growing long-term episodic & semantic memory“, it was one of the papers that stitched together a lot of concepts that had been loosely floating around in my head.

I don’t understand it sufficiently to explain it simply, but I’m getting there…

Embodied AI

I’ve thought for quite a while that some of the critical paths forward in building systems that learn need more complexity and variables in their world. A couple of weekends ago a friend passed me an article on Engadget about an AI contest set in Doom. Given the recent exemplary advances of systems learning to play Nintendo games, it’s a natural successor – that and I think there’s a lot of social overlap bewteen people interested in building AI systems and those who enjoy playing video games. Puzzle solvers all, on various levels.

Part of me is thrilled that we’re taking initial stabs and making challenges to extend state of the art. Another part of me is thinking “Oh great, as if this isn’t going to spread the Terminator existential fear.” I don’t really have any fear of the terminator scenario. Like some others, I think that’s at best a very distant concern in the advancement of AI, and far more driven by PR and media than the key of real issues.

I do have a deep and abiding concern of “bad actors” with the augmentation of AI at their fingertips. For the now and near future, AI is being applied as an augmentation onto people (or companies) – and their abuses of the resources could easily prove (and I think are far more likely) to be more damaging.

Although we’re both farther (and in a long term perspective it will still happen sooner than we expect) to creating a general artificial intelligence, I suspect the way it will be learn will be nearly the same way we do – partially being taught, and partially from experience. Today’s AI systems are the moral equivalent of mockingbirds – good at repeating something that already exists, but with limited capabilities in aplying innovation, or dealing with “surprise” events and learning from them.

I think a far more interesting effort in this space – not quite a click-bait worthy as AI’s playing DOOM – is what Microsoft is calling project Malmo – some mechanisms to start to provide a bit of embodiment in a world more simplified from our own: Minecraft. I haven’t seen much press about the results, contests, etc from opening up Project Malmo, but the PR and blog articles from earlier this summer point to where anyone can get started it, from their open code on github. (I’m still waiting for the Ballmer-esque world to come and take Microsoft back into the bad-ole-days and away from their renewed sanity around using and contributing to Open Source, but it seems to be sticking). I’d love to see some challenges built to Malmo, and with goals different than getting the highest kill count.

Can we versus should we

I made a conscious choice to use the phrase “should we…” when asking a question about technical implications rather than “can we…”. The difference is subtle and very significant.

There is a lot of self-pride in being able to solve the problem in the engineering community. Using probably too hasty a generalization, culturally its a bundle of people who love to solve puzzles. As we’ve been solving problems, some of the side effects or impacts of those problems are really about the organization we’re impacting, or the people consuming our tooling.

I mostly work on open source related tooling and projects these days: back-end or server-side code, distributed systems, micro-services if you want to use the buzzword bingo. In the end though, it’s all about making tools – and tools are for people. Michael Johnson, who makes tools for the folks at Pixar once offered me a bit of advice when I was at Disney making tools for their back-end developers. I’ll summarize what I remember, because the specific words are lost to my memory:

Pay attention to what the tool is doing as much as the technical work of doing it. Tools will shift the balance of an organization and how it works significantly.

That bit of advice has really stuck with me, and is something I think about when setting out on a project or even discussing adding some capability. What is this enabling? Who is it helping? How will they use it? What impact will it have?

It’s very easy to not look beyond the immediate impact of what the result of tool will enable, but it’s critical to make something really effective. And that’s why I try to ask the question “should we” – to set up the context for the discussion of not only what a feature in a tool can do, but what it means in a broader sense. In the end, it’s really just another puzzle – so getting a diverse set of brains working on possible solutions usually ends up with some good solutions, and surprising insights. It’s really a matter of just getting the right questions framed, and then getting out of the way to see what ideas come up.

Bugs: It’s a data problem

A number of years ago, a friend complained to me “Why does everyone seem to want to reinvent the bug tracker?!” That semi-inebriated conversation stayed with me as a question that has since rolled around in the back of my head.

A couple of years ago at Nebula, I was actively triaging bugs for a weekly development meeting. The number of issues logged was up around ~250. It wasn’t just bugs – big and small feature ideas were included, as were the “hey, we took a shortcut here to get this out the door for a customer, let’s get back here and clean this up” stuff. It was the whole mess of “tracking development” stuff, freely intermixed with what was far more appropriate for product management to manage. I was guiding the engineering effort, so I wallowed in this data all the time – I knew it inside and out. The reason that day stood out is that I had just hit my threshold for getting overwhelmed with the information.

It was at that point that I realized that bugs, development ideas, and tracking technical debt was fundamentally a data problem, and I needed to treat it as such to have any semblance of control in using it. Summarization and aggregation was critical, or we’d never be able to make any reasonable decisions about what should be focused on. And by reasonable I mean somewhat objective, because up until then I realized that everyone was counting on me to know all this information and provide the summary – and I was doing that intuitively and subjectively.

I call them “bugs” because that’s easy and I’m kind of lazy, and it’s kind of a cultural norm in software engineering. The software that has been most successful at wrangling this kind of data flow has generally been called “bug trackers”. Bugzilla from back in the day, Trac, Fogbugz, JIRA, or more often these days github issues. Bugs is not technically accurate, it’s all the ephemera from development, and it starts to get intermixed with “Hey, I just want to track and make sure we do this” stuff that may not even be about the product. Once you put in place a system that does tracking that you actually track, it gets used for all sorts of crap.

It gets more confusing because some of the data tracked really are bug reports. Bug reports have narrative included – or at least I hope narrative exists. And free-form narrative  makes it a complete pain in the butt to aggregate, summarize and simplify.

I think that is why many folks start adding on tags or summarization metadata to these lists. The most common are the priority tags – “really important” vs. “okay to do later” markers. Then there is the constant tension between fixing things and adding new things that tends to encourage some teams to utilize two different, but confusingly similar markers like “severity” vs “priority”. In the end, neither does a particularly good job of explaining the real complexities of the collection of issues. At this point, I would personally rather just make the whole damn thing a stack rank, factoring in all the relevant pieces, rather than trying to view along two, three, or more axis of someone else’s bucketing. In practice, it usually ends up getting bucketed – and who (individual or group) does the bucketing can be very apparent when you look though the data.

This is all fine while you’ve got “small amounts” of data. Small amounts in my case being up to about 250 reports. The exact number will be subjective to whomever is trying to use it. So far, I have been writing about being overwhelmed by the volume of this information. Growth over time caused most of that volume – there is always more work than you will ever be able to do. If you accelerate that flow by collecting all the crashes from remote programs – on desktops, or in browsers, or on mobile devices – the volume will accelerate tremendously. Even with a smaller “human” flow, there is a point where just getting narrative and data summarized falls apart and you need something other than a human being reading and collating to summarize it all.

A few years ago Mozilla started some really amazing work of dealing with the aggregate data collection or crashes and getting it summarized. They were tracking inputs on a scale that nobody can handle individually. They built software specifically to collect tracebacks and error messages that they automatically aggregate and summarize. They also set up some thoughtful process around how to handle the information. Variations on that theme have been productized and made into software solutions such as crashalytics (now part of Fabric), Rollbar, or Sentry.

Like its cousin performance metrics that are commonly collected on hosted services (with something like Grafana or Graphite), this is informational gold. This is the detail that you want to be able to react to.

It is easy to confuse what you want to react to, with how you want to react. It’s common when you’re growing a team; you start mapping process around the data. When you’re dealing with small numbers of bugs, it is common to think of each report as an exemplar – where duplicates will be closed if so reported – and you start to track not just what the issue is, but what you’re doing to fix it. (Truth is as the volume of reports grow, you almost never get any good process to handle closing duplicates)

At first its easy – it is either “an open issue” or it is “fixed”. Then maybe you want to distinguish between “fixed in your branch” and the fix included within your release. Pretty soon, you will find yourself wandering in the dead marshes trying to distinguish between all the steps in your development process – for nearly no value at all. You are no longer tracking information about the issue, you’re tracking how you’re reacting to the issue. Its a sort of a development will-o’-the-wisp, and not the nicer form from Brave, either.

The depths of this mire can be seen while watching well meaning folks tweak up Jira workflows around bugs. Jira allows massive customization and frankly encourages you down this merry path. I’m speaking from experience – I have managed Jira and its siblings in this space, made more than a few swamps, and drained a few as well.

Data about the bugs won’t change very much, but your development process will. When you put that process straight into the customization of a tool, you have just added a ton of overhead to keep it in sync with what is a dynamic, very human process. Unless you want to be in the business of writing the tools, I recommend you use the tools without customization and choose what you use based on what it will give you “out of the box”.

On top of that, keep the intentions separate: track the development process and what’s being done separately from feature requests and issues for a project. Maybe it’s obvious in hindsight, but as data problem, it gets a lot more tractable once you’re clear on what you’re after. Track and summarize the data about your project, and track and summarize what you’re doing about it (your dev process) separately.

It turns out that this kind of thing maps beautifully to support and customer relationship tracking tools as well. And yeah, I’ve installed, configured, and used few of those – Helpdesk, Remedy, ZenDesk, etc. The funny thing is, it’s really about the same kind of problem – collecting data and summarizing it so you can improve what you’re doing based on feedback – the informational gold to know where to improve.

Context and agency within conversational interfaces

I thought Tim O’Reilly’s recent article on Alexa made some really excellent points. The largest that stood out to me was that Alexa is always on and helpful, where the other solutions such as Siri or Cortana only listen when you ask them. Every time you start a new conversation, it is treated as totally independent to all other conversations.

With every conversation being independent, the context of a conversation is lost, or so minimal as to be perceived to be lost, within the conversational interface.

For an example, imagine this sequence with a conversational agent (Siri in this case):

  • (speaker) “tell karen i’m on the way home
  • (agent) [shows the message to be sent]
  • (speaker) clicks “send”

And then a few seconds later

  • (speaker) “where is she?
  • (agent) “Interesting question, Joseph

This last is Siri’s code phrase for “WTF are you talking about?”. To most speakers, it is very clear that I’m asking about Karen. In my expectation for a conversational agent, this context would be retained. I would expect Siri could understand the relevance of Karen in that sentence.

Siri does a reasonable job with disambiguating some of these unknowns into relevant specifics. When it is not sure what you mean, it asks. For example, the question “where is Karen” in my phone brings up “Which karen do you mean?” and shows me a list of potential choices from my address book. Once I clearly identify which one, I would hope that it would retain that context. In conversation, we often maintain a set of expectations that help inform and apply relevant context. And this is where most conversational agents currently break down – they don’t maintain, even for a short duration, any conversational context – no “knowledge” is built up or maintained, we’re just talking to the same clean slate every time.

When a conversational agent asks for clarification, it’s also making us expect something which many of the conversational agents do not have, or have only in a very limited sense: agency. We expect that the conversational agent will act independently and have it’s own actions in the world. Siri does not, however, function like that. Instead it tries to interact as though it’s acting for you – presenting augmented actions of your own rather than acting as an independent entity.

Here’s an example that illustrates that point:

  • (speaker) “tell karen I’m thinking of her
  • (agent) [shows the message “I’m thinking of her”]

What I would expect is a grammar translation – a message to be sent reading as “I’m thinking of you”.

One of the user experience benefits that Tim O’Reilly’s article asserted as a positive is that Alexa didn’t have a screen to show options or choices on. This forced their software to use conversation exclusively to share information, which is a different choice (I’m not sure it’s arguably better or worse) than Siri and Cortana. I personally prefer the multi-modal aspect of the interaction, simply because it’s often easier for me to visually scan a list of choices than it is to listen to options, where I end up feeling like I’m in phone response hell. “Please listen carefully to the following options, as they may have changed…” is 7 seconds of my life loose every freakin’ time.

As conversational interfaces grow, I think this is one area that Cortana and/or Siri may have a distinct advantage – the capability or reacting and interacting in not just a conversational manner, but visually as well. Microsoft’s earliest experiments in this space: the well intentioned but annoying-as-hell and now greatly malignedclippy” was all about attempting to understand context and intent by watching actions and trying to be predictive. My opinion on “The Big Mistake” was the initial concept that you should have such a system ever interrupt the actions being taken without invitation.

That said, the capability of using multi-modal inputs is incredibly powerful, and I don’t think we’ve even seen the initial effective use of it. The desktop era of the late 80’s added mouse movement to a keyboard, and the past decade added very effective touch interactions. Siri and Cortana are now adding voice as an input, expanding the way we could interact with our devices. Games have been starting to do this over the past few years, as game consoles have added games that take voice commands in addition to their controller inputs. Augmented reality systems could potentially add in the camera view they have, and with the object recognition technology that’s starting to be effective add even another layer of information for an agent to work with.

The human equivalent is easy to see – you can get so much more from a voice conversation than a written one because you have the additional information of tone and timing to help intuit additional context. A video conference provides even more information, using body language including facial expressions to share additional channels of information. These additional “modes” provide a broader amount of expression that are only starting to be explored and utilized in areas like affective computing

Proprioception

Proprioception is one of those big complex, compound latin-y words that means “knowing where you’re at”. Hit the wikipedia link for a much deeper definition. The reason I’m writing about it is because it’s something that I think anyone writing software for a distributed system should know about.

This kind of capability may not be a requirement or capability that you’re tracking or thinking about, but you should. Load balances (HAProxy or the various hardware ones like F5) have been using this concept for ages. It is now getting traction in the various services that orchestrate containers. Kelsey Hightower gave a talk about the concept at Monitorama this year. It is more than worth the 30 minutes of the video. I would actually suggest you just stop reading this and go watch it if you’re involved in writing services or building software. It is applicable well outside the space of back-end distributed systems, although his examples are all about building robust server-side software.

Then this morning in DevOps Weekly there was an article from NewRelic on how to use the same kind feature the Docker has recently added: HEALTHCHECK. It’s Docker specific and with excellent examples of why adding this into your service matters, and how you can use it. It’s not just containers either, Consul has a similiar mechanism built in for use with service discovery when using VMs or machines as well.

It’s a concept that I think a lot of folks are using, but without much consistency, and seems to be a good indicator of “Do they get it” when it comes to devops and robustly running large infrastructure. We should go beyond that, and from my perspective it should be an expected delivery, testing and validated for every component of a distributed system. It is an out-of-band channel for any application or service to say “I’m overloaded, or broken, or doing fine” – if you’re writing one, add this in.

Eighty percent polishing

Like anyone, I reflect current situations and problems back to things I’ve done, other problems I’ve solved, or more generally what I’ve learned. After I graduated from college, one of the things I spent a lot of time learning about was metalworking. Specifically the art of raising, which is a technique for taking sheet metals (iron, copper, steel, silver, etc) and forming it into 3 dimensional objects by working that over various forms, compressing and thining the sheet with hammer blows. You also have to periodically softening up the work by annealing (so it doesn’t crack and sunder). I still keep the copper flask that I formed from my earliest classes – it’s sitting on a shelf of artwork and plants, a reminder to me of persistence and functional beauty.

In retrospect, one of the most valuable things that I learned in using the knowledge from that class was that when you’re creating something the time to form up the rough object is pretty quick. It is the “finish work” that takes forever. The goal was often a smooth, shiny surface (although I enjoyed the dimpled hammer-beaten look myself) – and the reality is that after you’ve spent the time with the stakes and hammers, if you want that shiny surface, you’re going to have to work your butt off to get it. Cleaning, sanding, and polishing – another incremental process where it gets progressively better. That “finish work” as I like to call it, takes a good 80% of the overall time in making the piece.

Advance 20 years – I still have a propane forge and shop anvil, a few stakes, and a lovely collection of hammers. And most of my creative time is now going into software projects.

I see the same pattern play out: getting the project up and working, just operational – which is an awesome feeling – is just the first 20%. If you want the work to be a finished piece, and not just something you say “heh, yeah – I could see how that might work”, then you’re in for the remainder. Cleaning, sanding, and polishing in the software world is writing tests and fixing the operations so that they’re consistent and clear. Clear the technical debt so that future advances won’t be stunted from the start. Write documentation so that others can use what you made. Write notes to yourself or other developers so that you’ll have a clue of what it was meant to do when you come back to it in a year. It might also be making and refining a user interface (CLI, gui, or whatever) that work flawlessly, shares relevant information, and are accurate and beautiful in their own right.

The deeper take away that’s worth noting – the 80% there for polishing – it’s to make the object work for someone else. In a software world that’s often driven by user feedback. You had better be ready to listen – because (like with art and craftsmanship) it may or may not match your viewpoint and world view. It is inherently subjective, which can make it a real struggle for more analytically minded people.

My recommendation is to take the time, get the feedback, and refine your work. In the end, I think we all want things that are both beautiful and functional, and it takes work and feedback to get there. And that is, to my mind, the essence of the craftsmanship of the work.

Distributed service platforms largest challenge: network interoperability

The biggest challenge to introducing technologies to an organization today is all about making them available to the end users. Underneath the layers of software, and even hardware, it’s the network – and how technologies plug into that network is a constant challenge. I doubt it’s a surprise to anyone deeply involved in this space and deploying things like OpenStack, Docker Swarms, or Hadoop/Storm clusters.

The dark reality is that network interoperability is really crappy. Seriously, stunningly bad. Even with standards, the reality of vendor interoperability is troublesome and competing standards makes it even worse. Just within the space of container orchestration and exposing that across multiple/many physical servers, what you choose for networking is hugely impactful. Calico, Weave, Flannel, or bridging, possibly with OpenVSwitch, are all pushing for attention and adoption, with the goal of becoming a defacto standard. Individually these are all good solutions – each has their bonuses and draw backs – but the boundary to something else, the interface, is ultimately the problem.

Implementing one of these against an existing environment is the trouble point. It’s not just container platforms that drive this – many modern distributed system solutions have this same issue: OpenStack, Hadoop, Storm, in addition to the Docker, Marathon, and Kubernetes platform sets. Even pre-engineered solutions that are at the multiple-machine or rack scale struggle with this. There is not a tried and true method, and due to the ongoing complexities with just basic vendor network interoperability, it’s likely going to be a pain point for a number of more years.

The most common patterns I’ve seen to date try and make a router or layer-3 switch the demarkation point for a service. Define a boundary and (depending on the scale) a failure zone to make faults explicit, leveraging scale to reduce the impact. The critical difficulty comes with providing redundant and reliable access in the face of failures.

Configurations may include VLANS routed internally, and static IP address schemes – maybe based on convention, or maybe just assigned. Alternatively (and quite popularly), internal services are set up with an expectation of their own, private controlled network and the problem scope reduced to boundary points where egress and ingress are handled to at least restrict the “problem space”.

When this is done with appliances, it is typically resolved to reference configurations and pre-set tweaks for the most common network vendors. Redundancy choices move with the vendor solutions to handle the specifics needed to validate full interoperability – from MLAG setups with redundant switches to (this one always gets hated on, with good reasons) spanning tree to more recent work like Calico which interact directly with router protocols like BGP.

Watching this pain point for the last 5-6 years, I don’t see any clear and effectively interoperable solutions on the horizon. It is going to be a lot of heavy lifting to vet specific scenarios, against specific vendor implementations. I’m hopeful that the effort going into container runtime networking will help drive better interoperability. If nothing else, the push of private clouds, and now containers, are driving a lot of attention to this problem space. Maybe it will lie in the evolving overlay networks, the BGP protocols interop (like Calico), or the equivalents in the IPv6 space which only seem to be taking hold at the largest enterprises or government setups.

 

 

Data Center API – self-service monitoring

When I was Director of Cloud and Automoation Services at Disney, the group I reported into was a Disney-specific Internet Service Provider for the other business units (and there were a LOT of them). The rough goal of that position was “making it possible to run more things for business units, cheaper, faster, better”. It was the first stabs of that group doing self-service cloud-like capabilities for the rest of the organization, and ultimately led me onward to OpenStack, Nebula, and other interesting technical challenges.

One of the early challenges (that is still a challenge today) was applying some of the operational best practices to running services when doing it via “self-service”. The classic ops model (which I suspect is still there today, just mixed in with other options) used the “big guys” – HP OpenView, BMC, Tivoli, etc. What we wanted to do was enable a portion of what I was internally calling our “Data Center API” for the business unit developers. The goal was to make it possible for a developer to set up monitoring for their application without having to open a service desk ticket, which then led to some phone or email conversation, nailing down specifics of what should be monitoring, what thresholds applied, or even reaching in to the more complex areas beyond simple metrics and thresholds. In short, simplify it all down, and get the human process out of the way.

The licensing models for all of those products were (and still are) complete shit for self-service interactions and growth. They were also all configuration driven (not API driven), and some of them only allowed those configurations to be updated with a graphic user interface. Most of my vitriol was used on the ones that required Win32 applications to update configurations of those tools. We ended up doing some development around a cheaper option with a site-licensed model so that we didn’t have the incremental cost growth issues. In the end, the first cut was as simple as “we’ll ping test the VM when we set it up for you, and you can have us verify one or more arbitrary HTTP urls are active and send alerts (emails, in that case) if they aren’t”. It was all imperatively driven by the request to create the VM, and editable after the fact through an exposed-to-the-developer-web-interface. Not bad, really – although I wanted to make it a lot more.

Fast forward to today, some five years later, with microservice deployments exploding the scale footprint and diversity for monitoring – both metrics data and logs. PaaS services like Heroku’s add-on marketplace and AWS feature gaps giving services like Splunk, DataDog, New Relic, or DynaTrace a chance to exist, grow, and evolve into powerful tool sets. The open source/DIY side also exploded with the ELK stack, Graphite, Nagios, InfluxData, and Zenoss all mixing open source and commercially supported projects. Most recently tooling like Prometheus, Grafana, InfluxDB or James Turnbull’s book on the Art of Monitoring (including Riemann) have been grabbing a lot of attention, very deservedly so.

What caught my eye (and incited this post) is Prometheus’s advances, now released as 1.0 (although they’ve been pretty damn stable for a while). They followed the same keep-it-simple poll/scraping path that we did back at Disney, and have some blog posts related to scaling in that configuration. I have personally come to favor direct event driven reactions to reduce the cascade of added latencies and race conditions that you get the polling, but for the trailing-response sort of monitoring mechanisms, you can’t argue with their success and ability to scale it.

Most interestingly to me, they are implicitly encouraging a change in the scope of the problem. While they can scale to the “Missouri River in ’93” floods of data that monitoring an entire datacenter can deliver, they’ve defaulted to scoping down to the default area they’re covering to a complex application/service made up of microservices. One per, thank you very much. Prometheus is solved the setup and configuration problem by embedded into the Kubernetes service discovery mechanisms. As it changes its pods, data collection keeps rolling in as new individual services are added and removed. I suspect comparable orchestrastion technologies like Docker swarm, Terraform, and Marathon have similiar mechanisms, and AWS CloudFormation has included simple monitoring/alerting for ages within their basic API.

It still means another API to write to/work with – alert templates and such – and the client libraries that are available are the real win – available in any one of the major languages you might use in even the most diverse polyglot development environments. It’s the second most important component in the orchestration game – just getting the service up and running being the first.

This is taking a datacenter API concept and scaling it down so that it can run on a laptop for the developer. Prometheus, Kubernetes or Docker, etc – can all be run locally so you can instrument your application while you’re writing it, get real-time feedback on what and how its doing while you develop. Rather than just deploying your code, you can deploy your code next to an ELK stack configured for your services, and now monitoring with Prometheus as well.

This is the sort of functionality that I expect a Platform as a Service (Heroku, CloudFoundry, OpenShift, etc) to provide. Clear APIs and functionality to capture metrics and logs from my running application, to the point that you could create integration testing bound into these services. In my ideal world, these tools can capture and benchmark resource utilization during functional integration and regression tests, to annotate the pass/fail markers and acually provide, you know, metrics against which you can optimize and base real development decisions (which are also business decisions if you’re doing it right).

The closer we can move the detail needed to run an application “in production” to the developer, the bigger the win – all the details of testing, validation, and operational performance need to be at hand, as well as a clear means of using them.