Data Center API – self-service monitoring

When I was Director of Cloud and Automoation Services at Disney, the group I reported into was a Disney-specific Internet Service Provider for the other business units (and there were a LOT of them). The rough goal of that position was “making it possible to run more things for business units, cheaper, faster, better”. It was the first stabs of that group doing self-service cloud-like capabilities for the rest of the organization, and ultimately led me onward to OpenStack, Nebula, and other interesting technical challenges.

One of the early challenges (that is still a challenge today) was applying some of the operational best practices to running services when doing it via “self-service”. The classic ops model (which I suspect is still there today, just mixed in with other options) used the “big guys” – HP OpenView, BMC, Tivoli, etc. What we wanted to do was enable a portion of what I was internally calling our “Data Center API” for the business unit developers. The goal was to make it possible for a developer to set up monitoring for their application without having to open a service desk ticket, which then led to some phone or email conversation, nailing down specifics of what should be monitoring, what thresholds applied, or even reaching in to the more complex areas beyond simple metrics and thresholds. In short, simplify it all down, and get the human process out of the way.

The licensing models for all of those products were (and still are) complete shit for self-service interactions and growth. They were also all configuration driven (not API driven), and some of them only allowed those configurations to be updated with a graphic user interface. Most of my vitriol was used on the ones that required Win32 applications to update configurations of those tools. We ended up doing some development around a cheaper option with a site-licensed model so that we didn’t have the incremental cost growth issues. In the end, the first cut was as simple as “we’ll ping test the VM when we set it up for you, and you can have us verify one or more arbitrary HTTP urls are active and send alerts (emails, in that case) if they aren’t”. It was all imperatively driven by the request to create the VM, and editable after the fact through an exposed-to-the-developer-web-interface. Not bad, really – although I wanted to make it a lot more.

Fast forward to today, some five years later, with microservice deployments exploding the scale footprint and diversity for monitoring – both metrics data and logs. PaaS services like Heroku’s add-on marketplace and AWS feature gaps giving services like Splunk, DataDog, New Relic, or DynaTrace a chance to exist, grow, and evolve into powerful tool sets. The open source/DIY side also exploded with the ELK stack, Graphite, Nagios, InfluxData, and Zenoss all mixing open source and commercially supported projects. Most recently tooling like Prometheus, Grafana, InfluxDB or James Turnbull’s book on the Art of Monitoring (including Riemann) have been grabbing a lot of attention, very deservedly so.

What caught my eye (and incited this post) is Prometheus’s advances, now released as 1.0 (although they’ve been pretty damn stable for a while). They followed the same keep-it-simple poll/scraping path that we did back at Disney, and have some blog posts related to scaling in that configuration. I have personally come to favor direct event driven reactions to reduce the cascade of added latencies and race conditions that you get the polling, but for the trailing-response sort of monitoring mechanisms, you can’t argue with their success and ability to scale it.

Most interestingly to me, they are implicitly encouraging a change in the scope of the problem. While they can scale to the “Missouri River in ’93” floods of data that monitoring an entire datacenter can deliver, they’ve defaulted to scoping down to the default area they’re covering to a complex application/service made up of microservices. One per, thank you very much. Prometheus is solved the setup and configuration problem by embedded into the Kubernetes service discovery mechanisms. As it changes its pods, data collection keeps rolling in as new individual services are added and removed. I suspect comparable orchestrastion technologies like Docker swarm, Terraform, and Marathon have similiar mechanisms, and AWS CloudFormation has included simple monitoring/alerting for ages within their basic API.

It still means another API to write to/work with – alert templates and such – and the client libraries that are available are the real win – available in any one of the major languages you might use in even the most diverse polyglot development environments. It’s the second most important component in the orchestration game – just getting the service up and running being the first.

This is taking a datacenter API concept and scaling it down so that it can run on a laptop for the developer. Prometheus, Kubernetes or Docker, etc – can all be run locally so you can instrument your application while you’re writing it, get real-time feedback on what and how its doing while you develop. Rather than just deploying your code, you can deploy your code next to an ELK stack configured for your services, and now monitoring with Prometheus as well.

This is the sort of functionality that I expect a Platform as a Service (Heroku, CloudFoundry, OpenShift, etc) to provide. Clear APIs and functionality to capture metrics and logs from my running application, to the point that you could create integration testing bound into these services. In my ideal world, these tools can capture and benchmark resource utilization during functional integration and regression tests, to annotate the pass/fail markers and acually provide, you know, metrics against which you can optimize and base real development decisions (which are also business decisions if you’re doing it right).

The closer we can move the detail needed to run an application “in production” to the developer, the bigger the win – all the details of testing, validation, and operational performance need to be at hand, as well as a clear means of using them.

AWS expands to an IDE in the cloud

I missed the announcement a few weeks ago: Cloud9 got scooped up by Amazon in mid-july. I used Cloud9 on and off over the past two years, primarily as a free service. While it has its definite limits, it’s an amazing piece of technology. In the realm of “OMG, look what you can do in a browser now!”, a functional IDE is as far out there as FPS video games. The service uses all the advances in the browser/javascript engine and some impressive coordination with container based technology to make an excellent IDE experience in your browser. Then they step beyond with browser technology by providing the ability to have more than one person working on the same files in the IDE at the same time. Just like two people can work in a Google Doc together, you can collaborate on code in the same fashion.

In my experiments, I was highly focused on javascript and some python – strong spots for Cloud9. As IDEs went it was decent but not as full powered as Visual Studio, XCode, or Eclipse. More like a beefed up text editor – like Atom – but really several steps beyond that. Its a little hard to characterize.

The reason is caught my eye was two fold:

  • I could collaborate with another remote person on code in a mechanism like pair programming.
  • My development environment was no longer pinned to my laptop, and even better – was available on any device where I sat down.

I had really achieved the second bullet some time before, but by utilizing remote terminals, linux shells and vim enhancements, and having tooling and ways to set up a VM or container to what I wanted. I maintained the equivalent of a Vagrant file for creating an openstack instance with exactly my development environment for ages. But the downside was the loss of the graphical environment – most of which I didn’t mourn too much, but there’s a benefit to an effective UI.

Cloud9 leveraged github accounts (today’s defacto default “developer identity”) to connect, and offered a fairly simple pricing scheme if you were applying a team to it. CodeEnvy appears to be doing something similiar, but with the usual “enterprise spin” and greater integrations into enterprise mechanisms. I personally enjoyed using Coda for IOS when I was seeing what options worked best for development while on an iPad Pro with a keyboard. Coda, along with Prompt, gave me great access to any cloud resources, with the speed/responsiveness benefit of a truly local editor.

The IDE setup for C9 is still available. In fact – you can see my experimental workspace with the core libraries from the RackHD project, last updated about 8 months ago. The underlying systems support way more than javascript these days – Ruby, Python, PHP, C++ – really, anything you can develop on Linux. Their IDE language tooling was focused on javascript last I played with it (8+ months ago), with burgeoning support for other languages. I haven’t actively used it recently to assess what’s useful today, but likely even more.

A few of my coworkers have taken to setting up preconfigured containers for their development environment, leveraging docker to make their development environment tooling consistent while still using local editors to manipulate files. They’re doing all their development in Go. There’s a tremendous number of positive things to say about Go, but how to set up and use its tooling across different projects and dependencies isn’t among them. In that scenario, a Docker based tooling environment was a joy. Wish they’d ditch the Make files though, I’ve always hated them – regardless of how effective they are.

The big question in my head is “What will Amazon do with Cloud9”? There is some supposition that Amazon used it to pull focus away from Google, but Cloud9 also had ties/integrations into Salesforce and Heroku. I hope it wasn’t a situation of “we’re unsustainable, where the hell can we go to pay out our creditors” fire sale. Amazon has toyed for ages with how to best apply reasonable and effective UIs (and UX in general) over their services. They suck at it, to be honest. Cryptic, arcane, but if you know it and don’t mind a lot of confusion and cheat sheets, you can make it work. Not unlike Windows 3.1 back in the day.

Anyway, this hopefully this marks a point of infusion of UI/UX sanity into AWS service efforts. They need it.

paint and sandboxes in 3D and VR

I had a singular joy last night: watching my love “paint” in 3D using the HTC Vive VR googles and a couple of wands. Standing nearby meant being likely to get poked, but the expressions of joy and wonder that she was expression while reaching through the air was truly wonderful. Inspiring, actually.

With the work I do today, few folks will ever know about it, let alone pick it up, and get such an intense enjoyable moment from it. I’ve described myself as a “digitial plumber” several times in the past, and I think that generally still applies. The art and science of building, running, and debugging distributed systems for a variety of purposes, mentoring teams in software engineering, bridging the gap bewteen need/vision and reality of what can be done, and sometimes coaching ancillary professions around the edges. I get a tremendous satisfaction from writing the code I write, seeing it work, and knowing it’s helping people, albiet most of them are other digital plumbers. That same helping people is why I’m involved with open source on several levels, have been in the past in a variety of communities, and I’m sure will be in the future in other communities. I play a team game, and revel in the “small groups getting impressive stuff done”.

The VR google experience last night stood out against that. Or more particularly, the expressions of joy and wonder while experiencing it, playing with it. While I’ve working for media companies, some really big ones, I was never directly involved in anything leading towards the creative content, or even the tooling in support of it. I’ve got to admit, after last night’s session I am feeling the lure.

I’ve no idea what I’ll do with it. I’ve been looking forward to seeing more in VR and what’s available, tracking the news on hardware, software tooling and engines available, demo’s that are attracking attention, and of course the big boys of the tech – HTC, Oculus, Sony, and Microsoft. I think anything in this space today will be more like a movie or a AAA game in terms of the spend needed to really assemble a solid experience, but there’s enough pieces and parts to start playing with it in a “desktop publishing” fashion – limitations galore, but still something that can be expressed and potentially shared. We’ll have to see what lies therein…

The evolving world of the server-side developer

Over the past couple of weeks, I’ve talked with a number of folks about Kubernetes, Marathon, Docker Compose, AWS CloudFormation, Terraform, and the bevy of innovation that’s been playing out over the past 12-18 months as Docker swims across the ecosystem.  This is at the top of a lot of people’s minds – how to make applications easier to run, easier to manage themselves, and how to do that consistently. I wrote a little about this whole set a few weeks ago. I thought it would be worthwhile to put down my view of the changing world of a server-side software developer.


All of these technologies are focused on making it easier to run sets of applications (with the in-vogue term being microservices) and keep them running. The significant change is containers evolving from the world of virtual machines. The two are not mutually exclusive – as you can easily run containers in VMs, but fundamentally containers are doing to virtual machines what virtual machines did to physical x86 servers – they’re changing up the management domain – encapsulating the specific thing you want to run, giving it a solid structure, and (mostly) making it easy.

AWS took virtual machines and accelerated the concept of dividing up physical hardware into smaller pieces. Spin up, tear down, API driven, measured and controlled the crap out of it – and made The Cloud. That change, that speed, has shifted our expectations on what it takes to get the infrastructure in place so we can produce software. When we get to basics, everything we’re doing is about running an application: doing what makes it easier, faster, cheaper, more secure, more reliable. Microsoft Azure, AWS, and Google compute are leading the race to drive that even further towards commoditization. And if you’re asking “what the hell does that mean” – it means economically cheaper and consistent for all practical purposes. The cost of running an application is more measurable than it has ever been before, and it shouldn’t surprise anyone that business, let alone technologies would encourage us to optimize against cost and speed.

For software delivery, virtual machines mostly followed the pattern of physical machines for deploying an OS, using it for patches and updates, and leveraged the whole idea of OS packages as a means of deploying software. Lots of companies still dropped some compressed bundle of code for their own software deployment (a zip, war, jar, or whatever). A few bothered to learn the truly arcane craft of making OS level packages, but a pretty minimal set of folks every really integrated that into their build pipelines. The whole world of configuration management tools (cfEngine, Puppet, and Chef) errupted in this space to make keeping this explosive growth of “OS instances” – virtual servers – in control.

Now containers exist, and have formalized a way of layering up “just what we need” and to some extent “exactly what we want”. Whether it’s a compiled binary or a interpretter and code, containers let you pull it all together, lock it into a consistent structure, and run it easily. On top of that speed, you’re removing a lot of cruft with containers – it’s one of the most brilliant benefits: trimming down to just what you need to run the application. In a world where we can now consistently and clearly get numbers for “what it costs to run … on AWS”, it’s an efficiency we can measure, and we do. Virtual machines were doing this (slowly) with efforts like JeOS (just enough OS), but Docker and friends got into that footrace and the older just left the JeOS efforts standing still.

In all this speed, having consistent ways to describe sets of processes working together, their connections and dependencies together, and being able to set that all up quickly, repeatedly, and reliably is a powerful tool. In the VM world, that is what AWS Cloudformation does – or Terraform, or JuJu, or even BOSH or SaltStack, if you squint at them. Those all work with VMs – the basic unit being a virtual machine, with an OS and processes getting installed. In the container world, the same stuff is happening with Kubernetes, Marathon, or Docker Compose. And in the container world, a developer can mostly do all that on their laptop…


Docker scales down extremely well. Developers are reaping this benefit, since you can “run the world” on your laptop. Where a developer might have needed to have a whole virtual infrastructure (e.g. an ESXi cluster, or a cloud account), the slimness of containers often means we can curry away all the cruft and get down to just the applications – and that slimming is often sufficient to run everything on a laptop. The punchline “Your datacenter on a laptop” isn’t entirely insane. With everything at the developers fingertips, overall development productivity increases, quality can be more easily verified, etc. The win is all about the cycle time from writing code to seeing it work.

Another benefit is the specificity of the connections between processes. I can’t tell you the number of times in the VM or physical based world where I was part of a review chain, trying to drag out what processes connected to what, what firewalls holes needed to be punched, etc. We drew massive visio diagrams, reviewed them, and spent a lot of hours  bitching about having the update the things. Docker compose, Kubernetes, etc – includes all this information up front. Where we created those visio diagrams, they’re now part of the code. Imaging a dynamic visio/box like diagram that shows a Docker Compose up and running, state of each process and status/utilization of all the connections. Like the fire-control stations on a ship, the whole thing diagramed – sections lighting up  for clarity of what’s working, and what isn’t. We are steps away from that concept.


Upgrading is now a whole different beast. Containers are pretty locked down, and due to their nature of overlaying filesystems, an update to an underlying layer generally means you need a whole new container deployed. The same path happens for a code update and a security patch to your SSL libraries. How your tooling supports this is a big deal, and most of these systems have a pretty clear means of doing what’s typically called a blue/green deployment to solve this problem. It remains to be seen what we do for tracking security updates, CVEs that apply to underlying libraries, and how we patch and update our code as we find the issues. Up until a few months ago, there weren’t even initial solutions to help developers do this.

Containers also allow the environment to scale. Add more hardware, use more resources – the application not only scales down, but it scales up. Apple, Facebook, Microsoft, Google, Netflix – and frankly a ton of companies you’re maybe not even aware of – all have applications that span many machines. With the connections between containers an integral part of the system, that scaling process is no longer unclear, and the world of scheduling and multi-container management is where some of the most interesting innovation is happening today. Applications can go from running on a laptop to scaling across multiple datacenters. The “how” this happens – what tools are used and how the logistical problems are solved remains open. It’s a hell of a challenge, and there’s probably some we’ve not even hit on yet that we’ll find as critical to solving these problems.

the roads forward

These last two items – scaling and upgrading/updates – are the two places where the frameworks will distinguish themselves. How they help developers, what they do, what limits they place, and how well they work will drive the next generation of software development for “server side” developers. All of this is independent of the language you’re writing it. Java, Go, Javascript, Python, Erlang, Ruby – it all applies the same.

There are whole new set of problems that will emerge with distributed systems being the norm. Emergent behaviors and common patterns still have to be established and understood, there’s a whole world of “new to many” that we’ll be working in for quite a while. Like the crazy interesting that rule 30 is to cellular automota.



Will AWS Chalice be the straw that broke Heroku?

A friend of mine sent me a link about a “serverless python microframework” called Chalice. Chalice is an offering from AWS labs, a public experiment of sorts. As I watched the youtube video, I thought “damn, AWS finally stepped into the PaaS world”.

Using Chalice is a incredibly straightforward. If you’re familiar with python development, then the slight variation on Flask will be almost immediately understood. Chalice itself settles into the niches within the plethora of AWS functionality that effectively make it a developer-driven command line develop, deploy and verify – all on the cloud stack, and without having to mess with much configuration: exactly the win that PaaS provides. It is AWS’ first serious step into the space where Google App Engine, Heroku, CloudFoundry, and OpenShift have been slowly building up and competing.

I look at Heroku (which runs on AWS) and how they developed their business model. Heroku will be sticky for a while, but I’ve got to imagine that if AWS wants to play that game and grab that segment of the hosted PaaS market, Heroku is going to be working double time differentiate themselves and keep their market. Google has their infrastructure to back them up, and frankly were first into this game. CloudFoundry and OpenShift aren’t fighting for the hosted market, but the enterprise marketplace. Heroku however, is a bit more out on a limb.

Heroku could play the game of better user experience. Without a doubt it is, already. It could bolster that with feature capabilities as well – more integrated solutions to drive out that “easier to develop and debug” space. But there is a problem there – where they could have expanded internally, they have already built a marketplace. New Relic, DataDog, and a flurry of other hosted solutions have all refined themselves in that marketplace, providing impressive services that will be hard to replicate. That is going to make it a lot harder for Heroku to use this for more “built-in” features to differentiate.

I’m sure this is just the opening shot in coming marketplace fight for hosted PaaS. Chalice is barely out of its diapers, but you can easily see where it can grow. The gang at Salesforce has a big fight on their hands…

How To: Debugging and Isolation

There’s a number of skills that I think are critical to being a good engineer. Although my 20-year-old self would have scoffed at the idea, “clear, correct and concise communication” is probably the most important one. As a team gets larger, this gets harder – for everyone. Sharing ideas and knowledge, communicating across groups that do things differently, or just even have a different focus, can be, and often is, a challenge.

One of the places I see this is in the boundary between folks who’ve written or created some code (and implicitly know how some or all of it works) and folks who are trying to use the same – but without the knowledge. They’re learning, trying things out, getting confused, often asking for help.

A conversation often solves that pain point, but someone needs to have that knowledge to have that conversation. Or maybe, it’s not working the way you expected, and you’re now in the state where you need to figure out how someone else got a surprising result. You need a way to do this learning. That is where debugging and isolation come in – techniques to do this learning.

Here’s the dirty little secret: almost nobody, developers and creators of the systems included, keep perfect knowledge of how a system works in their heads. Everyone uses this technique, even if they don’t call it as such. Knowing it, knowing about it, and using it intentionally is a skill every engineer should have.

In my head, I call this “The game of figuring out how this thing works”. It’s a kind of game I’ve loved since I was a kid, which probably goes a long way to explaining me. Truth is, we all do this as kids, although wrapping some names and a formal process around it isn’t something that most folks do. Point is, you already know how to do this – it’s just a matter of practice to get better and applying it.

Playing this game is simple, here’s how it works:

  • describe the thing you’re trying to figure out
  • make a guess (create a hypothesis) about “if I do XXX, then YYY will happen”
  • try it and see!
    • if it didn’t work, make another guess
    • if it did work, add that to your knowledge
  • rinse and repeat, cook until done, etc

This first sequence is the portion called debugging. It works, but getting to a full description of everything you can do, and every reaction that could come out, is and can be a time consuming affair. What you’re doing in your head is building up a model of how this works – in essence, you’re reverse engineering to figure it out. And yeah, if you clued in that this is exactly the same thing as the “scientific method“, you’re on the right track.

Note that I called it “simple”, not “easy”. This particular game can be very frustrating, just as it can be incredibly rewarding. There are doctoral thesis and research papers on ways to do this better, specific techniques to use in specific situations, etc. In the software engineering world, there’s also tooling to help get you the information you’re after in an easily consumable form. But that’s not always there, and things like the tooling aren’t distributed evenly. Recognize that this can be hard and difficult. It is worth calling that out, because it frequently seems like anything simple should also be easy.

In the debugging I described at the top of this article, you’re doing this as a precursor to figure out to make something work the way you expected it to work, instead of the way it is currently working. Or maybe you’re trying to reproduce what someone else saw, but doesn’t happen for you. Whatever the specific goal you are trying to achieve, the basic process is the same.

So what’s this isolation? Isolation is a way to make debugging easier and faster, and makes things a bit easier to generally figure our problems. The whole idea behind isolation is that you can often take a problem and break it down into subproblems that can be solved independently. Or said another way, most “things” are made up of components that are assembled together, and you can often use this structure to reduce the scope or what you’re guessing about.

Most things we build are like this, and we tend to naturally think about things in hierarchies and layers. Sometimes there is the trick of even figuring out what something is made of (that would be why we’re smashing atoms together: to break them apart and figure out what they’re made of), but when you’re working on figuring out a software problem you have the benefit of having some nice clean boundaries on that problem. Well, hopefully clean boundaries. Often it’s written done, or you can ask someone, or if it’s open source, you can read the code.

I go back and forth between isolation and debugging techniques. I start off making sure that I understand how something is put together by coming up with my tests and verifying them, and when something doesn’t work as I expected, then I break it down into the smallest piece of the puzzle that isn’t working the way I expected.

Keep a piece of scrap paper around while you’re doing this. It’s my cheap version of a scientists lab notebook. I personally love butcher paper and a few colored pens, or a good whiteboard – but I mostly make do with a legal notepad and whatever pen happens to be at hand. Draw the diagrams of what you’re thinking the system is broken down into. Then draw how they’re interacting. Don’t try to keep it all in your head. Just the act of drawing it out and scribbling on a piece of paper can help organize your own mental model of what is happening and how they are connected.

There is a side bonus to scribling down notes and diagrams as well: when it comes time to communicate with someone else, you have something you can share. I’m a very visual person – I like to describe with diagrams, and sometimes a prodigious amount of hand waving – showing movement, focus of conversation, etc. In general, sharing a visual representation of what you’re talking about with others can be a great way of helping explain what you’ve learned to someone else.

There are plenty of formal ways to do this drawing (UML diagrams such as layer diagrams, sequence diagrams, etc).  While you are doing this for yourself – draw in a way that is meaningful to you. Writing the notes is for you first and foremost – if you want to share this information, you can clean it up after the fact. Bubbles and lines connecting them is one I like to use – scrawling graphs if need be. Sometimes layers and boxes makes more sense. And most of the time I’m doing this sort of thing, it’s the interactions between components – their sequences and what happens next – that you’re diagramming and trying to figure out (this is where different colored pens come in handy, plus who doesn’t like lots of colors in the diagram!)

At the end of this process, you are probably trying to share this detail with someone else. I would encourage you to use the adage that was drilled into me by physics professors in college: show your work. There is nothing wrong with taking a photo of your notebook page where you figured out where something wasn’t working as you expeced, and including that with writing describing what you found, and what you’re after.

Most systems these days make it easy to include pictures as well text – bug reports and pull requests in github, for example. I have a set of employees who just aboslutely adore putting this kind of thing in powerpoint. Honestly, whatever works and you can share with someone else is good. There is a whole set of skills around communicating clearly, but just getting the process started is more than half the battle.

Leadership skill: moderation

Coffee conversation threaded through a number of topics this morning, mostly flitting around diversity and how it is a continued challenge. I’m implying in technology fields, but really it’s everywhere. A post by Anne Halsall on Medium that has stuck with me is her post When Women Engineer. If you haven’t seen it, take a moment and give it a solid read-through. It is an excellent post that is looking at the crossover of sexism and how most companies have a very male-oriented bias on how they interpret, well, everything.

When I first read Anne’s post, one of my first thoughts was “Hey, some of what you’re seeing isn’t just male-sexism, it’s corporate-standard stupidity in dealing with different personalities”. There’s some of that in there and Anne reiterated that it wasn’t just lack of dealing with different personalities.

One of the skills critical to leadership is being an effective moderator. The reason applies to some of what Anne saw, but extends through different personalities, and as I’ve concretely learned over the past 18 months, is an excellent tool to help overcome significant cultural differences. “Overcoming” meaning communicating clearly – not forcing a change in interaction that come from different cultures.

Moderating is not just about making sure everyone obeys some basic politeness rules – I think my best metaphor right now is that it’s like working a complex sound engineering board where each of the people in the conversation are one of the inputs. Some are darned quite, some are loud – and leveling those out as best as possible is an obvious part of the effort. But there’s also the communication that may come across as garbeled. What may not be apparent as a need is being not only willing, but pro-active in asking clarifying questions or otherwise changing the flow of the conversation to get out a full understanding. The queues for this can be really darned subtle – a raised eyebrow, furrowed brows, or other more whole-body expressions of surprise – or the tone of the conversation a few minutes later.

A moderator has implicit (sometimes explicit) control of the conversation, and establishing that up front – often as simple as saying “I’ll be modering this meeting” – is important. You can’t be an effective moderator without it, and there are definitely times when it’s forgotten. You may need to wait out a particularly long winded or excessively vocal individual. I personally try not to engage in the “shout them down” tactic, but I’ll admit to having used it too. Honestly, when I’ve had to do that, I figured that I screwed up some better method – it’s just so… inelegant.

There is also a bit of a social contract that you are agreeing to when you’re moderating: that you won’t be injecting your own hidden agenda into the conversation. That is, you won’t be using the conversational control to your end “evil” ends. Hidden is the critical word here – all conversations have a purpose, so making it explicit what that purpose is up front – calling it out before any in-depth conversation has happened – is a good way of getting that into everyone’s heads. From there, it’s paying attention to the conversation flow, the individuals, and guiding it to try and achieve those ends. You may have to reiterate that purpose during the conversation – that’s OK. Plenty of good stuff is found in the eddys of the main flow – don’t stop it entirely, but have a sense of when it’s time to get back on track.

You might have read that good meetings all have agendas. I’m generally in agreement and one of the formulas that I try to use is starting off any conversation with the agenda and goals. That helps in immediately setting the stage for moderating the conversation, as in doing so, you have implicitly asserted that you’re paying attention to that purpose and goal. In the “conversation is a river flowing metaphor”, you stepped up the rudder and said you’d pilot the boat.

This applies all over the place in discussions – a scrum daily standup meeting (which is nicely formulaic, even though I tend to repeat what I’m looking for in the status update), to design brainstorming to technical topic discussions of “the best way to thread that mobius needle”.

One of the characteristics that a moderator has to have (or overcome) is to be willing to engage in what may be perceived as conflict. That is, you have to be willing to step in, contradict, and stop or redirect a conversation. Growing up, I was a pretty conflict-adverse person – so much so that I’d let people  walk over me, and walk over a conversation. I had to really work on that, be willing to step into a conversation, to signal with body language or conversational gambits that we needed to stop and/or re-route the flow of conversation. And yes, you’ll hit those times when all of those mechanisms fail – when emotion has gotten so heated that someone is just raving on – the only thing you can do is to stop the conversation entirely. It may even come across as terrifically rude, but the best thing you can do is get up and step away from the conversation. Sometimes that physically walking out of the room. Another choice is to let the individual who’s really going on just exhaust themselves, but recognize that the conversation may be best to set aside for the time being, or the problem/agenda may need to be reframed entirely.

As a side note, engineers – often male engineers – can be notoriously obtuse or just outright ignorant of body language cues in conversation. Most of the time I think it is non-malicious, but there will be people who intentionally ignore body or even verbal cues in order to continue their point or topic, or to ignore or override others trying to be involved.

A skill I have personally focused on developing (and which I recommend) is the ability to take notes while moderating a discussion. It may sound to you like “How on earth would you have TIME to take notes as well as moderate?!”. The answer: make the time. I am willing to ask for a pause in the conversation while I’m taking notes when backed up a bit, and by writing down the perspectives and summaries of what people are saying, it helps me externalize the content of what has been said. It actually makes it clearer to me, as the very process forces me to summarize and replay back what I thought I heard. I’ve found more instances of what was garbled communication by writing it down. When I heard it I thought I internalized it, but when I tried to right it down I realized that it wasn’t making sense. And then there is the side benefit of having written notes of the meeting, which I recommend saving – as not everyone will remember the conversation, and in some cases may not have been a part of it.

To reiterate, moderating is a skill that I view as critical to good leadership. If you’re leading a team, formally or informally, think about how you can apply it. Think about it, and DO it.  It’s one of the ways a leader can “get roadblocks out of the way”, and if you’re aspiring to lead teams, it’s something you’d do well to invest your time in learning.

Neural network in your pocket

One of the most interesting announcements at WWDC has flown mostly flown under the radar: an extension to the already impressive Accelerate framework called BNNS. Accelerate, if you don’t know, is a deeply (often hand) optimized low-level library for computation that’s set up to leverage the specifics of Apple platform. It’s the “go-to” library for numerical computations, linear algebra, matrix manipulations, etc for speed and efficiency.

BNNS stands for “Basic Neural Network Subroutines” and includes optimized routines for the key pieces of setting up and using convolutional neural networks. The intent of this library is to provide efficient libraries for putting together enough of a neural network to classify or recognize. In particular, it doesn’t include the functions that you would use for training a neural network, although I’m sure its possible to manually add the code to do exactly that.

When I heard about BNNS and watched WWDC Session 715, I immediately wondered what it would take to start with something in KerasTensorflow or Theano: do the training with those libraries having constructed the network, and then port the network and trained values into a neural network constructed with BNNS routines. I suspect there’s a pretty reasonably straightforward way to do it, but I haven’t dug it out yet.

Fortunately, recognition or inference is relatively “cheap” computationally – at least compared to training of networks. So I expect there is a good potential to train externally, load as a data feed into a IOS application, and then recognize it. If you find examples of anyone doing and showing that, I would appreciate a comment about it here on the blog. Considering that training a large, diverse network can eat of the equivalent of a couple of racks of high-end-gaming-machine-quality GPU systems, that’s not something I’d look at trying to do on an IOS device.

While I think I understand most of the common pieces of building neural networks, I can’t really claim it yet as I haven’t tried to teach it to someone else, or yet used it “in anger” – when I needed to get something specific done NOW. Either of those is what usually has me hitting those subtle roadblocks and potholes that lead to “Ah-ha!” moments of deep understanding. This may be a good side project: to pick a couple of easier targets and see where I can get to. Maybe grab the MNIST digit recognition or something and see what I can train with Keras or Tensorflow, and then figure out how to translate and port that into classifiers for IOS using BNNS.


Marathon, Kubernetes, Docker Compose, Terraform, Puppet, Chef, Ansible, SaltStack, and Juju

Yesterday, I wrote about fully autonomic services. It’s been on my mind for months now, or more accurately years. To get to the end state of a self-managing service, there’s a ton of knowledge that needs to be encapsulated in some form. As we represent this knowledge, we tend to it in one of two forms: imperative and declarative, and often built up in layers.

Programmers deal with this in their day to day jobs. They’re imparting knowledge, putting it into a form that systems (computer languages, libraries, and frameworks) can use. The expression of that knowledge, and how it relates to what we’re trying to do, is the art and science, the essential craftsmanship of programming.

Imperative and declarative weave back in forth in the expression of programming. While it’s really something of a “chicken and egg” argument, I’d say we often start with imperative forms of expression, especially while we’re exploring all the different ways we could solve the problems, and learning from that. As it starts to become common – as we “commoditize” or standardize on ways of doing this, we create new language (sometimes new computer languages!) to represent that learning – we turn it into a declarative form. Sometimes this is in the form of libraries or frameworks, sometimes it’s in the form of layers of software, encapsulating and simplifying.

To create a fully autonomic systems, we need to capture information about how to do the things we want the service to do. The challenge in this space is two fold: how to capture that knowledge, and coming up with good ways to frame the problems. If you look at Docker ComposeKubernetes and Marathon, they are presenting a means of thinking about this problem. They define declarative mechanisms to describe the programs/services they’re managing. Marathon calls them ‘applications‘, and Kubernetes calls a similar concept ‘deployments‘, Compose uses ‘docker-compose‘ as that declaration, sort of avoiding a description like ‘application’ or ‘deployment’ altogether.  Terraform calls the concept ‘configurations‘, JuJu calls it ‘charms‘. They’re all declarations on what to run and how they relate to each other in order to provide some specific software service.

One of the most notable differences in these systems is that Compose, Marathon and Terraform all stop at the level of declaring the structures, letting plugins or “something else” orchestrate the coordination for the interesting tasks of a red/green upgrade or rolling upgrade, where Kubernetes is taking a stab at including a means of doing just that within it’s domain of responsibility. The implication in the case of Kubernetes is that developers deploying services with Kubernetes will learn (or know) how the system works and develop code within those constraints – in programmers terms, use it like a framework. In the case or Marathon, it expects a developer to tell it what to do – to use it more like a library. Kubernetes is far more opinionated in this respect, and in a large part betting on the knowledge that it’s development community already has for proving out that it got the right level of abstractions nailed down.

A notable difference in Marathon and Kubernetes from Terraform and Compose is that they include an expected responsibility to keep running and “keep an eye” on the ‘applications’/’deployments’ they’re responsible for.

Puppet, Chef, Ansible, and SaltStack are all focused on the world of configuring services within a single virtual (or physical) machine – the “install and configure” and “start it running”. The concepts they were built for has a responsibility stop at the boundary of getting the virtual (or physical) machine set up, and didn’t include the concept of handling a failure of the machine. Keep in mind that these systems were created long before cloud computing was a reality, and the idea of asking for another machine wasn’t a few seconds of work behind an API, but days, weeks, or months of work. In a general sense, what they made declarative was the pieces within a virtual machine, and didn’t expand to the realm of stitching a lot of virtual machines together.

For the container versus virtual machine divide: Yes, it’s possible to use Puppet, Chef, and Ansible to configure containers, but I’d easily argue that while you can also drive a screw into the wall with a hammer, that doesn’t make it a terrifically good idea or use of the relevant tools.

As a side note: SaltStack stands out in this space with the concept of a reactor, so it’s intentionally sticking around and listening to what’s happening from within the VM (or VMs) that it’s ‘responsible for’.

The stand-out challenge that I keep in my mind for this kind of automation is “How can it be used to easily install, configure, run, and keep running an instance of OpenStack on 3 to 50 physical machines?” It doesn’t have to be OpenStack, of course – and I don’t mean “an IaaS service”, but a significantly complex application with different rules and challenges for the different parts needed to run it. I spent two years at the now defunct Nebula doing exactly this, and it’s a challenge that none of these systems can completely solve themselves. It’s why “OpenStack Fuel” exists as a project, encapsulating that knowledge that’s otherwise represented in Puppet declarations and orchestrated externally using something they created called “Railgun“.

Another side note: this particular challenge has been the source of several companies (Nebula, Piston, and Mirantis) and the failing of many enterprise installations of OpenStack, as just getting the bloody thing installed is a right pain in the ass.

Kubernetes, Marathon, Compose, and Terraform wouldn’t stand a chance of the ‘OpenStack’ challenge, primarily because they all expect an IaaS API to exist which can give them a “server” when they need it, or they work at the level containers, where a Container API (Docker, RKT, GCE, etc) can go spin up a container at request. The concept of the challenge is still useful there – but probably needs another form for a real example. Take a look at Netflix’s architecture and their adoption of micro services for another of the seriously complex use case that’s a relevant touch-stone. Every company that’s seriously providing web-based services has these same kinds of complexity and scale issues, and each services architecture is different. Making those architectures a simpler reality is what these systems are after.


That’ll teach me to write such a review the day before DockerCon. This morning, the Docker responsibilites and capabilities changed:-)

As of Docker 1.12, it looks like Docker Compose has been combined into the code, and Docker is heading to take on some of the same space as Marathon and Kubernetes, called that feature “Docker Stacks and Distributed Application Bundles“. It’ll take me a bit to review the new material to see where this really lands out, but I’m not at all surprised that Docker is reaching their essential product responsibilities into this area.



Fully Autonomic Services

Fifteen years ago, IBM introduced a term into the world of computing that’s stuck with me: Autonomic Computing. The concept fascinated me – first at the most basic level, simple recovery scenarios and how to write more robust services. That interest led to digging around in the technologies that enable seamless failover and in more recent years into distributed systems and managing those systems – quorum and consensus protocols, how to develop both with and for them, and the technologies that have quite a bit of attention in some circles – managing sets of VMs or containers – to provide comprehensive services.

On a walk this morning, I was reflecting on those interests and how they have all been on a path to fully autonomic computing. A goal of self managing services: services with human interfaces that require far less deep technical knowledge in order to get the capabilities available. Often that deep knowledge was myself, so in some respects I’ve been trying to program myself out of a job for the past 15 years.

Ten years ago, many of my peers were focused on cfEngine, Puppet, and later Chef: “Software configuration management systems”. Data centers and NOCs were often looking for people “skilled with ITIL” and knowledgable in “effective change management controls”, with an intrinsic goal of being the humans to provide the capabilities that autonomic services were aimed at providing. The technology has continued to advance – plenty of proprietary solutions I generally won’t detail, and quite a number of open source technologies that I will.

JuJu first caught my attention, both with its horrifically stupid name, and it’s powerful capabilities. It went beyond a single machine, to represent the combinations of applications and back-ends that make up a holilstic service, to set it up and run it. It was very Canonical, but also open source. More recently, SaltStack and Terraform continued this pattern with VMs as the base content – the unit of distribution, leveraging the rise of cloud computing. Many years before this, the atomic unit of delivery was an OS package, or maybe a tarball, JAR, or later WAR file. All super specific to the implementations of whatever OS, or in the case of JAR/WAR – language. Cloud services that have finally started to a compute server (VM) as a commodity, disposable resource into the common vernacular, and Docker popularized taking that “down a step” to containers as the unit of deployment.

Marathon and Kubernetes are now providing service orchestration for containers, and while I personally use VMs most commonly, I think containers may be the better path forward, simply because I expect them to be cheaper in the long run. The cloud providers have been in this arena for a while – HEAT in OpenStack as the obvious clone of Amazon CloudFormation, and a variety of startups and orchestration technologies that solve some of the point problems around the same space, and the whole hype-ball of “serverless”, leveraging small bits of computing responding to events as an even greater level of possible efficiency.

Moving this onto physical devices that install into a small office, or even a home, is a tricky game, and the generalized technology isn’t there, although there are some clear wins. This is what Nutanix excels at – small cloud-services-in-a-box. Limited options, easy to set up, seamless to use, commodity price point.

Five years ago I was looking at this problem through the lens of an enterprise “service provider” – what many IT/web-ops shops have become, small internal service providers to their companies, and frankly competing on some level with Amazon, Google, and Azure. I was still looking at the space in terms of “What would be an amazing “datacenter” API that developers could leverage to run their services?”. “Where are the costs in running an Enterprise data center, and how can we reduce them?” was another common question. I thought then, and still tend to believe, the ultimate expression of that would be something like Heroku, or it’s open source clone/private enterprise variant: Pivotal CloudFoundry. Couple that kind of base platform with various logging and other administrative capabilities supporting your services, and you remove a tremendous amount of cost from the space of managing a datacenter – at least when applications can move onto it, and therein lies a huge portion of the crux. Most classic enterprise services can’t move like that, and many may never.

In the past several years, I’ve come to think a lot more about small installations of autonomic services. The small local office with a local physical presence. Running on bare metal, to be specific. In that kind of idea, something like Kubernetes or Marathon not in the large – crossing an entire datacenter, but in the small – focusing on a single service becomes really compelling. Both of these go beyond “setting up the structure” that Terraform does, and like a distributred initD script or systemD unit, they actively monitor what they’ve started, at least on some level. Both open source software platforms haven’t really stitched everything together to get to a point of reacting seamlessly to service levels, but it’s getting pretty damned close. With these tools, you’re nearly at the point where you can have a single mechanism that creates a service, keeps it running, upgrades it when you have updates, and can scale up (or down) to some extent, and recover from some failures.

We’re slowly getting close to fully autonomic services.