Proprioception

Proprioception is one of those big complex, compound latin-y words that means “knowing where you’re at”. Hit the wikipedia link for a much deeper definition. The reason I’m writing about it is because it’s something that I think anyone writing software for a distributed system should know about.

This kind of capability may not be a requirement or capability that you’re tracking or thinking about, but you should. Load balances (HAProxy or the various hardware ones like F5) have been using this concept for ages. It is now getting traction in the various services that orchestrate containers. Kelsey Hightower gave a talk about the concept at Monitorama this year. It is more than worth the 30 minutes of the video. I would actually suggest you just stop reading this and go watch it if you’re involved in writing services or building software. It is applicable well outside the space of back-end distributed systems, although his examples are all about building robust server-side software.

Then this morning in DevOps Weekly there was an article from NewRelic on how to use the same kind feature the Docker has recently added: HEALTHCHECK. It’s Docker specific and with excellent examples of why adding this into your service matters, and how you can use it. It’s not just containers either, Consul has a similiar mechanism built in for use with service discovery when using VMs or machines as well.

It’s a concept that I think a lot of folks are using, but without much consistency, and seems to be a good indicator of “Do they get it” when it comes to devops and robustly running large infrastructure. We should go beyond that, and from my perspective it should be an expected delivery, testing and validated for every component of a distributed system. It is an out-of-band channel for any application or service to say “I’m overloaded, or broken, or doing fine” – if you’re writing one, add this in.

Eighty percent polishing

Like anyone, I reflect current situations and problems back to things I’ve done, other problems I’ve solved, or more generally what I’ve learned. After I graduated from college, one of the things I spent a lot of time learning about was metalworking. Specifically the art of raising, which is a technique for taking sheet metals (iron, copper, steel, silver, etc) and forming it into 3 dimensional objects by working that over various forms, compressing and thining the sheet with hammer blows. You also have to periodically softening up the work by annealing (so it doesn’t crack and sunder). I still keep the copper flask that I formed from my earliest classes – it’s sitting on a shelf of artwork and plants, a reminder to me of persistence and functional beauty.

In retrospect, one of the most valuable things that I learned in using the knowledge from that class was that when you’re creating something the time to form up the rough object is pretty quick. It is the “finish work” that takes forever. The goal was often a smooth, shiny surface (although I enjoyed the dimpled hammer-beaten look myself) – and the reality is that after you’ve spent the time with the stakes and hammers, if you want that shiny surface, you’re going to have to work your butt off to get it. Cleaning, sanding, and polishing – another incremental process where it gets progressively better. That “finish work” as I like to call it, takes a good 80% of the overall time in making the piece.

Advance 20 years – I still have a propane forge and shop anvil, a few stakes, and a lovely collection of hammers. And most of my creative time is now going into software projects.

I see the same pattern play out: getting the project up and working, just operational – which is an awesome feeling – is just the first 20%. If you want the work to be a finished piece, and not just something you say “heh, yeah – I could see how that might work”, then you’re in for the remainder. Cleaning, sanding, and polishing in the software world is writing tests and fixing the operations so that they’re consistent and clear. Clear the technical debt so that future advances won’t be stunted from the start. Write documentation so that others can use what you made. Write notes to yourself or other developers so that you’ll have a clue of what it was meant to do when you come back to it in a year. It might also be making and refining a user interface (CLI, gui, or whatever) that work flawlessly, shares relevant information, and are accurate and beautiful in their own right.

The deeper take away that’s worth noting – the 80% there for polishing – it’s to make the object work for someone else. In a software world that’s often driven by user feedback. You had better be ready to listen – because (like with art and craftsmanship) it may or may not match your viewpoint and world view. It is inherently subjective, which can make it a real struggle for more analytically minded people.

My recommendation is to take the time, get the feedback, and refine your work. In the end, I think we all want things that are both beautiful and functional, and it takes work and feedback to get there. And that is, to my mind, the essence of the craftsmanship of the work.

Distributed service platforms largest challenge: network interoperability

The biggest challenge to introducing technologies to an organization today is all about making them available to the end users. Underneath the layers of software, and even hardware, it’s the network – and how technologies plug into that network is a constant challenge. I doubt it’s a surprise to anyone deeply involved in this space and deploying things like OpenStack, Docker Swarms, or Hadoop/Storm clusters.

The dark reality is that network interoperability is really crappy. Seriously, stunningly bad. Even with standards, the reality of vendor interoperability is troublesome and competing standards makes it even worse. Just within the space of container orchestration and exposing that across multiple/many physical servers, what you choose for networking is hugely impactful. Calico, Weave, Flannel, or bridging, possibly with OpenVSwitch, are all pushing for attention and adoption, with the goal of becoming a defacto standard. Individually these are all good solutions – each has their bonuses and draw backs – but the boundary to something else, the interface, is ultimately the problem.

Implementing one of these against an existing environment is the trouble point. It’s not just container platforms that drive this – many modern distributed system solutions have this same issue: OpenStack, Hadoop, Storm, in addition to the Docker, Marathon, and Kubernetes platform sets. Even pre-engineered solutions that are at the multiple-machine or rack scale struggle with this. There is not a tried and true method, and due to the ongoing complexities with just basic vendor network interoperability, it’s likely going to be a pain point for a number of more years.

The most common patterns I’ve seen to date try and make a router or layer-3 switch the demarkation point for a service. Define a boundary and (depending on the scale) a failure zone to make faults explicit, leveraging scale to reduce the impact. The critical difficulty comes with providing redundant and reliable access in the face of failures.

Configurations may include VLANS routed internally, and static IP address schemes – maybe based on convention, or maybe just assigned. Alternatively (and quite popularly), internal services are set up with an expectation of their own, private controlled network and the problem scope reduced to boundary points where egress and ingress are handled to at least restrict the “problem space”.

When this is done with appliances, it is typically resolved to reference configurations and pre-set tweaks for the most common network vendors. Redundancy choices move with the vendor solutions to handle the specifics needed to validate full interoperability – from MLAG setups with redundant switches to (this one always gets hated on, with good reasons) spanning tree to more recent work like Calico which interact directly with router protocols like BGP.

Watching this pain point for the last 5-6 years, I don’t see any clear and effectively interoperable solutions on the horizon. It is going to be a lot of heavy lifting to vet specific scenarios, against specific vendor implementations. I’m hopeful that the effort going into container runtime networking will help drive better interoperability. If nothing else, the push of private clouds, and now containers, are driving a lot of attention to this problem space. Maybe it will lie in the evolving overlay networks, the BGP protocols interop (like Calico), or the equivalents in the IPv6 space which only seem to be taking hold at the largest enterprises or government setups.

 

 

Data Center API – self-service monitoring

When I was Director of Cloud and Automoation Services at Disney, the group I reported into was a Disney-specific Internet Service Provider for the other business units (and there were a LOT of them). The rough goal of that position was “making it possible to run more things for business units, cheaper, faster, better”. It was the first stabs of that group doing self-service cloud-like capabilities for the rest of the organization, and ultimately led me onward to OpenStack, Nebula, and other interesting technical challenges.

One of the early challenges (that is still a challenge today) was applying some of the operational best practices to running services when doing it via “self-service”. The classic ops model (which I suspect is still there today, just mixed in with other options) used the “big guys” – HP OpenView, BMC, Tivoli, etc. What we wanted to do was enable a portion of what I was internally calling our “Data Center API” for the business unit developers. The goal was to make it possible for a developer to set up monitoring for their application without having to open a service desk ticket, which then led to some phone or email conversation, nailing down specifics of what should be monitoring, what thresholds applied, or even reaching in to the more complex areas beyond simple metrics and thresholds. In short, simplify it all down, and get the human process out of the way.

The licensing models for all of those products were (and still are) complete shit for self-service interactions and growth. They were also all configuration driven (not API driven), and some of them only allowed those configurations to be updated with a graphic user interface. Most of my vitriol was used on the ones that required Win32 applications to update configurations of those tools. We ended up doing some development around a cheaper option with a site-licensed model so that we didn’t have the incremental cost growth issues. In the end, the first cut was as simple as “we’ll ping test the VM when we set it up for you, and you can have us verify one or more arbitrary HTTP urls are active and send alerts (emails, in that case) if they aren’t”. It was all imperatively driven by the request to create the VM, and editable after the fact through an exposed-to-the-developer-web-interface. Not bad, really – although I wanted to make it a lot more.

Fast forward to today, some five years later, with microservice deployments exploding the scale footprint and diversity for monitoring – both metrics data and logs. PaaS services like Heroku’s add-on marketplace and AWS feature gaps giving services like Splunk, DataDog, New Relic, or DynaTrace a chance to exist, grow, and evolve into powerful tool sets. The open source/DIY side also exploded with the ELK stack, Graphite, Nagios, InfluxData, and Zenoss all mixing open source and commercially supported projects. Most recently tooling like Prometheus, Grafana, InfluxDB or James Turnbull’s book on the Art of Monitoring (including Riemann) have been grabbing a lot of attention, very deservedly so.

What caught my eye (and incited this post) is Prometheus’s advances, now released as 1.0 (although they’ve been pretty damn stable for a while). They followed the same keep-it-simple poll/scraping path that we did back at Disney, and have some blog posts related to scaling in that configuration. I have personally come to favor direct event driven reactions to reduce the cascade of added latencies and race conditions that you get the polling, but for the trailing-response sort of monitoring mechanisms, you can’t argue with their success and ability to scale it.

Most interestingly to me, they are implicitly encouraging a change in the scope of the problem. While they can scale to the “Missouri River in ’93” floods of data that monitoring an entire datacenter can deliver, they’ve defaulted to scoping down to the default area they’re covering to a complex application/service made up of microservices. One per, thank you very much. Prometheus is solved the setup and configuration problem by embedded into the Kubernetes service discovery mechanisms. As it changes its pods, data collection keeps rolling in as new individual services are added and removed. I suspect comparable orchestrastion technologies like Docker swarm, Terraform, and Marathon have similiar mechanisms, and AWS CloudFormation has included simple monitoring/alerting for ages within their basic API.

It still means another API to write to/work with – alert templates and such – and the client libraries that are available are the real win – available in any one of the major languages you might use in even the most diverse polyglot development environments. It’s the second most important component in the orchestration game – just getting the service up and running being the first.

This is taking a datacenter API concept and scaling it down so that it can run on a laptop for the developer. Prometheus, Kubernetes or Docker, etc – can all be run locally so you can instrument your application while you’re writing it, get real-time feedback on what and how its doing while you develop. Rather than just deploying your code, you can deploy your code next to an ELK stack configured for your services, and now monitoring with Prometheus as well.

This is the sort of functionality that I expect a Platform as a Service (Heroku, CloudFoundry, OpenShift, etc) to provide. Clear APIs and functionality to capture metrics and logs from my running application, to the point that you could create integration testing bound into these services. In my ideal world, these tools can capture and benchmark resource utilization during functional integration and regression tests, to annotate the pass/fail markers and acually provide, you know, metrics against which you can optimize and base real development decisions (which are also business decisions if you’re doing it right).

The closer we can move the detail needed to run an application “in production” to the developer, the bigger the win – all the details of testing, validation, and operational performance need to be at hand, as well as a clear means of using them.

AWS expands to an IDE in the cloud

I missed the announcement a few weeks ago: Cloud9 got scooped up by Amazon in mid-july. I used Cloud9 on and off over the past two years, primarily as a free service. While it has its definite limits, it’s an amazing piece of technology. In the realm of “OMG, look what you can do in a browser now!”, a functional IDE is as far out there as FPS video games. The service uses all the advances in the browser/javascript engine and some impressive coordination with container based technology to make an excellent IDE experience in your browser. Then they step beyond with browser technology by providing the ability to have more than one person working on the same files in the IDE at the same time. Just like two people can work in a Google Doc together, you can collaborate on code in the same fashion.

In my experiments, I was highly focused on javascript and some python – strong spots for Cloud9. As IDEs went it was decent but not as full powered as Visual Studio, XCode, or Eclipse. More like a beefed up text editor – like Atom – but really several steps beyond that. Its a little hard to characterize.

The reason is caught my eye was two fold:

  • I could collaborate with another remote person on code in a mechanism like pair programming.
  • My development environment was no longer pinned to my laptop, and even better – was available on any device where I sat down.

I had really achieved the second bullet some time before, but by utilizing remote terminals, linux shells and vim enhancements, and having tooling and ways to set up a VM or container to what I wanted. I maintained the equivalent of a Vagrant file for creating an openstack instance with exactly my development environment for ages. But the downside was the loss of the graphical environment – most of which I didn’t mourn too much, but there’s a benefit to an effective UI.

Cloud9 leveraged github accounts (today’s defacto default “developer identity”) to connect, and offered a fairly simple pricing scheme if you were applying a team to it. CodeEnvy appears to be doing something similiar, but with the usual “enterprise spin” and greater integrations into enterprise mechanisms. I personally enjoyed using Coda for IOS when I was seeing what options worked best for development while on an iPad Pro with a keyboard. Coda, along with Prompt, gave me great access to any cloud resources, with the speed/responsiveness benefit of a truly local editor.

The IDE setup for C9 is still available. In fact – you can see my experimental workspace with the core libraries from the RackHD project, last updated about 8 months ago. The underlying systems support way more than javascript these days – Ruby, Python, PHP, C++ – really, anything you can develop on Linux. Their IDE language tooling was focused on javascript last I played with it (8+ months ago), with burgeoning support for other languages. I haven’t actively used it recently to assess what’s useful today, but likely even more.

A few of my coworkers have taken to setting up preconfigured containers for their development environment, leveraging docker to make their development environment tooling consistent while still using local editors to manipulate files. They’re doing all their development in Go. There’s a tremendous number of positive things to say about Go, but how to set up and use its tooling across different projects and dependencies isn’t among them. In that scenario, a Docker based tooling environment was a joy. Wish they’d ditch the Make files though, I’ve always hated them – regardless of how effective they are.

The big question in my head is “What will Amazon do with Cloud9”? There is some supposition that Amazon used it to pull focus away from Google, but Cloud9 also had ties/integrations into Salesforce and Heroku. I hope it wasn’t a situation of “we’re unsustainable, where the hell can we go to pay out our creditors” fire sale. Amazon has toyed for ages with how to best apply reasonable and effective UIs (and UX in general) over their services. They suck at it, to be honest. Cryptic, arcane, but if you know it and don’t mind a lot of confusion and cheat sheets, you can make it work. Not unlike Windows 3.1 back in the day.

Anyway, this hopefully this marks a point of infusion of UI/UX sanity into AWS service efforts. They need it.

paint and sandboxes in 3D and VR

I had a singular joy last night: watching my love “paint” in 3D using the HTC Vive VR googles and a couple of wands. Standing nearby meant being likely to get poked, but the expressions of joy and wonder that she was expression while reaching through the air was truly wonderful. Inspiring, actually.

With the work I do today, few folks will ever know about it, let alone pick it up, and get such an intense enjoyable moment from it. I’ve described myself as a “digitial plumber” several times in the past, and I think that generally still applies. The art and science of building, running, and debugging distributed systems for a variety of purposes, mentoring teams in software engineering, bridging the gap bewteen need/vision and reality of what can be done, and sometimes coaching ancillary professions around the edges. I get a tremendous satisfaction from writing the code I write, seeing it work, and knowing it’s helping people, albiet most of them are other digital plumbers. That same helping people is why I’m involved with open source on several levels, have been in the past in a variety of communities, and I’m sure will be in the future in other communities. I play a team game, and revel in the “small groups getting impressive stuff done”.

The VR google experience last night stood out against that. Or more particularly, the expressions of joy and wonder while experiencing it, playing with it. While I’ve working for media companies, some really big ones, I was never directly involved in anything leading towards the creative content, or even the tooling in support of it. I’ve got to admit, after last night’s session I am feeling the lure.

I’ve no idea what I’ll do with it. I’ve been looking forward to seeing more in VR and what’s available, tracking the news on hardware, software tooling and engines available, demo’s that are attracking attention, and of course the big boys of the tech – HTC, Oculus, Sony, and Microsoft. I think anything in this space today will be more like a movie or a AAA game in terms of the spend needed to really assemble a solid experience, but there’s enough pieces and parts to start playing with it in a “desktop publishing” fashion – limitations galore, but still something that can be expressed and potentially shared. We’ll have to see what lies therein…

The evolving world of the server-side developer

Over the past couple of weeks, I’ve talked with a number of folks about Kubernetes, Marathon, Docker Compose, AWS CloudFormation, Terraform, and the bevy of innovation that’s been playing out over the past 12-18 months as Docker swims across the ecosystem.  This is at the top of a lot of people’s minds – how to make applications easier to run, easier to manage themselves, and how to do that consistently. I wrote a little about this whole set a few weeks ago. I thought it would be worthwhile to put down my view of the changing world of a server-side software developer.

background

All of these technologies are focused on making it easier to run sets of applications (with the in-vogue term being microservices) and keep them running. The significant change is containers evolving from the world of virtual machines. The two are not mutually exclusive – as you can easily run containers in VMs, but fundamentally containers are doing to virtual machines what virtual machines did to physical x86 servers – they’re changing up the management domain – encapsulating the specific thing you want to run, giving it a solid structure, and (mostly) making it easy.

AWS took virtual machines and accelerated the concept of dividing up physical hardware into smaller pieces. Spin up, tear down, API driven, measured and controlled the crap out of it – and made The Cloud. That change, that speed, has shifted our expectations on what it takes to get the infrastructure in place so we can produce software. When we get to basics, everything we’re doing is about running an application: doing what makes it easier, faster, cheaper, more secure, more reliable. Microsoft Azure, AWS, and Google compute are leading the race to drive that even further towards commoditization. And if you’re asking “what the hell does that mean” – it means economically cheaper and consistent for all practical purposes. The cost of running an application is more measurable than it has ever been before, and it shouldn’t surprise anyone that business, let alone technologies would encourage us to optimize against cost and speed.

For software delivery, virtual machines mostly followed the pattern of physical machines for deploying an OS, using it for patches and updates, and leveraged the whole idea of OS packages as a means of deploying software. Lots of companies still dropped some compressed bundle of code for their own software deployment (a zip, war, jar, or whatever). A few bothered to learn the truly arcane craft of making OS level packages, but a pretty minimal set of folks every really integrated that into their build pipelines. The whole world of configuration management tools (cfEngine, Puppet, and Chef) errupted in this space to make keeping this explosive growth of “OS instances” – virtual servers – in control.

Now containers exist, and have formalized a way of layering up “just what we need” and to some extent “exactly what we want”. Whether it’s a compiled binary or a interpretter and code, containers let you pull it all together, lock it into a consistent structure, and run it easily. On top of that speed, you’re removing a lot of cruft with containers – it’s one of the most brilliant benefits: trimming down to just what you need to run the application. In a world where we can now consistently and clearly get numbers for “what it costs to run … on AWS”, it’s an efficiency we can measure, and we do. Virtual machines were doing this (slowly) with efforts like JeOS (just enough OS), but Docker and friends got into that footrace and the older just left the JeOS efforts standing still.

In all this speed, having consistent ways to describe sets of processes working together, their connections and dependencies together, and being able to set that all up quickly, repeatedly, and reliably is a powerful tool. In the VM world, that is what AWS Cloudformation does – or Terraform, or JuJu, or even BOSH or SaltStack, if you squint at them. Those all work with VMs – the basic unit being a virtual machine, with an OS and processes getting installed. In the container world, the same stuff is happening with Kubernetes, Marathon, or Docker Compose. And in the container world, a developer can mostly do all that on their laptop…

benefits

Docker scales down extremely well. Developers are reaping this benefit, since you can “run the world” on your laptop. Where a developer might have needed to have a whole virtual infrastructure (e.g. an ESXi cluster, or a cloud account), the slimness of containers often means we can curry away all the cruft and get down to just the applications – and that slimming is often sufficient to run everything on a laptop. The punchline “Your datacenter on a laptop” isn’t entirely insane. With everything at the developers fingertips, overall development productivity increases, quality can be more easily verified, etc. The win is all about the cycle time from writing code to seeing it work.

Another benefit is the specificity of the connections between processes. I can’t tell you the number of times in the VM or physical based world where I was part of a review chain, trying to drag out what processes connected to what, what firewalls holes needed to be punched, etc. We drew massive visio diagrams, reviewed them, and spent a lot of hours  bitching about having the update the things. Docker compose, Kubernetes, etc – includes all this information up front. Where we created those visio diagrams, they’re now part of the code. Imaging a dynamic visio/box like diagram that shows a Docker Compose up and running, state of each process and status/utilization of all the connections. Like the fire-control stations on a ship, the whole thing diagramed – sections lighting up  for clarity of what’s working, and what isn’t. We are steps away from that concept.

challenges

Upgrading is now a whole different beast. Containers are pretty locked down, and due to their nature of overlaying filesystems, an update to an underlying layer generally means you need a whole new container deployed. The same path happens for a code update and a security patch to your SSL libraries. How your tooling supports this is a big deal, and most of these systems have a pretty clear means of doing what’s typically called a blue/green deployment to solve this problem. It remains to be seen what we do for tracking security updates, CVEs that apply to underlying libraries, and how we patch and update our code as we find the issues. Up until a few months ago, there weren’t even initial solutions to help developers do this.

Containers also allow the environment to scale. Add more hardware, use more resources – the application not only scales down, but it scales up. Apple, Facebook, Microsoft, Google, Netflix – and frankly a ton of companies you’re maybe not even aware of – all have applications that span many machines. With the connections between containers an integral part of the system, that scaling process is no longer unclear, and the world of scheduling and multi-container management is where some of the most interesting innovation is happening today. Applications can go from running on a laptop to scaling across multiple datacenters. The “how” this happens – what tools are used and how the logistical problems are solved remains open. It’s a hell of a challenge, and there’s probably some we’ve not even hit on yet that we’ll find as critical to solving these problems.

the roads forward

These last two items – scaling and upgrading/updates – are the two places where the frameworks will distinguish themselves. How they help developers, what they do, what limits they place, and how well they work will drive the next generation of software development for “server side” developers. All of this is independent of the language you’re writing it. Java, Go, Javascript, Python, Erlang, Ruby – it all applies the same.

There are whole new set of problems that will emerge with distributed systems being the norm. Emergent behaviors and common patterns still have to be established and understood, there’s a whole world of “new to many” that we’ll be working in for quite a while. Like the crazy interesting that rule 30 is to cellular automota.

 

 

Will AWS Chalice be the straw that broke Heroku?

A friend of mine sent me a link about a “serverless python microframework” called Chalice. Chalice is an offering from AWS labs, a public experiment of sorts. As I watched the youtube video, I thought “damn, AWS finally stepped into the PaaS world”.

Using Chalice is a incredibly straightforward. If you’re familiar with python development, then the slight variation on Flask will be almost immediately understood. Chalice itself settles into the niches within the plethora of AWS functionality that effectively make it a developer-driven command line develop, deploy and verify – all on the cloud stack, and without having to mess with much configuration: exactly the win that PaaS provides. It is AWS’ first serious step into the space where Google App Engine, Heroku, CloudFoundry, and OpenShift have been slowly building up and competing.

I look at Heroku (which runs on AWS) and how they developed their business model. Heroku will be sticky for a while, but I’ve got to imagine that if AWS wants to play that game and grab that segment of the hosted PaaS market, Heroku is going to be working double time differentiate themselves and keep their market. Google has their infrastructure to back them up, and frankly were first into this game. CloudFoundry and OpenShift aren’t fighting for the hosted market, but the enterprise marketplace. Heroku however, is a bit more out on a limb.

Heroku could play the game of better user experience. Without a doubt it is, already. It could bolster that with feature capabilities as well – more integrated solutions to drive out that “easier to develop and debug” space. But there is a problem there – where they could have expanded internally, they have already built a marketplace. New Relic, DataDog, and a flurry of other hosted solutions have all refined themselves in that marketplace, providing impressive services that will be hard to replicate. That is going to make it a lot harder for Heroku to use this for more “built-in” features to differentiate.

I’m sure this is just the opening shot in coming marketplace fight for hosted PaaS. Chalice is barely out of its diapers, but you can easily see where it can grow. The gang at Salesforce has a big fight on their hands…

How To: Debugging and Isolation

There’s a number of skills that I think are critical to being a good engineer. Although my 20-year-old self would have scoffed at the idea, “clear, correct and concise communication” is probably the most important one. As a team gets larger, this gets harder – for everyone. Sharing ideas and knowledge, communicating across groups that do things differently, or just even have a different focus, can be, and often is, a challenge.

One of the places I see this is in the boundary between folks who’ve written or created some code (and implicitly know how some or all of it works) and folks who are trying to use the same – but without the knowledge. They’re learning, trying things out, getting confused, often asking for help.

A conversation often solves that pain point, but someone needs to have that knowledge to have that conversation. Or maybe, it’s not working the way you expected, and you’re now in the state where you need to figure out how someone else got a surprising result. You need a way to do this learning. That is where debugging and isolation come in – techniques to do this learning.

Here’s the dirty little secret: almost nobody, developers and creators of the systems included, keep perfect knowledge of how a system works in their heads. Everyone uses this technique, even if they don’t call it as such. Knowing it, knowing about it, and using it intentionally is a skill every engineer should have.

In my head, I call this “The game of figuring out how this thing works”. It’s a kind of game I’ve loved since I was a kid, which probably goes a long way to explaining me. Truth is, we all do this as kids, although wrapping some names and a formal process around it isn’t something that most folks do. Point is, you already know how to do this – it’s just a matter of practice to get better and applying it.

Playing this game is simple, here’s how it works:

  • describe the thing you’re trying to figure out
  • make a guess (create a hypothesis) about “if I do XXX, then YYY will happen”
  • try it and see!
    • if it didn’t work, make another guess
    • if it did work, add that to your knowledge
  • rinse and repeat, cook until done, etc

This first sequence is the portion called debugging. It works, but getting to a full description of everything you can do, and every reaction that could come out, is and can be a time consuming affair. What you’re doing in your head is building up a model of how this works – in essence, you’re reverse engineering to figure it out. And yeah, if you clued in that this is exactly the same thing as the “scientific method“, you’re on the right track.

Note that I called it “simple”, not “easy”. This particular game can be very frustrating, just as it can be incredibly rewarding. There are doctoral thesis and research papers on ways to do this better, specific techniques to use in specific situations, etc. In the software engineering world, there’s also tooling to help get you the information you’re after in an easily consumable form. But that’s not always there, and things like the tooling aren’t distributed evenly. Recognize that this can be hard and difficult. It is worth calling that out, because it frequently seems like anything simple should also be easy.

In the debugging I described at the top of this article, you’re doing this as a precursor to figure out to make something work the way you expected it to work, instead of the way it is currently working. Or maybe you’re trying to reproduce what someone else saw, but doesn’t happen for you. Whatever the specific goal you are trying to achieve, the basic process is the same.

So what’s this isolation? Isolation is a way to make debugging easier and faster, and makes things a bit easier to generally figure our problems. The whole idea behind isolation is that you can often take a problem and break it down into subproblems that can be solved independently. Or said another way, most “things” are made up of components that are assembled together, and you can often use this structure to reduce the scope or what you’re guessing about.

Most things we build are like this, and we tend to naturally think about things in hierarchies and layers. Sometimes there is the trick of even figuring out what something is made of (that would be why we’re smashing atoms together: to break them apart and figure out what they’re made of), but when you’re working on figuring out a software problem you have the benefit of having some nice clean boundaries on that problem. Well, hopefully clean boundaries. Often it’s written done, or you can ask someone, or if it’s open source, you can read the code.

I go back and forth between isolation and debugging techniques. I start off making sure that I understand how something is put together by coming up with my tests and verifying them, and when something doesn’t work as I expected, then I break it down into the smallest piece of the puzzle that isn’t working the way I expected.

Keep a piece of scrap paper around while you’re doing this. It’s my cheap version of a scientists lab notebook. I personally love butcher paper and a few colored pens, or a good whiteboard – but I mostly make do with a legal notepad and whatever pen happens to be at hand. Draw the diagrams of what you’re thinking the system is broken down into. Then draw how they’re interacting. Don’t try to keep it all in your head. Just the act of drawing it out and scribbling on a piece of paper can help organize your own mental model of what is happening and how they are connected.

There is a side bonus to scribling down notes and diagrams as well: when it comes time to communicate with someone else, you have something you can share. I’m a very visual person – I like to describe with diagrams, and sometimes a prodigious amount of hand waving – showing movement, focus of conversation, etc. In general, sharing a visual representation of what you’re talking about with others can be a great way of helping explain what you’ve learned to someone else.

There are plenty of formal ways to do this drawing (UML diagrams such as layer diagrams, sequence diagrams, etc).  While you are doing this for yourself – draw in a way that is meaningful to you. Writing the notes is for you first and foremost – if you want to share this information, you can clean it up after the fact. Bubbles and lines connecting them is one I like to use – scrawling graphs if need be. Sometimes layers and boxes makes more sense. And most of the time I’m doing this sort of thing, it’s the interactions between components – their sequences and what happens next – that you’re diagramming and trying to figure out (this is where different colored pens come in handy, plus who doesn’t like lots of colors in the diagram!)

At the end of this process, you are probably trying to share this detail with someone else. I would encourage you to use the adage that was drilled into me by physics professors in college: show your work. There is nothing wrong with taking a photo of your notebook page where you figured out where something wasn’t working as you expeced, and including that with writing describing what you found, and what you’re after.

Most systems these days make it easy to include pictures as well text – bug reports and pull requests in github, for example. I have a set of employees who just aboslutely adore putting this kind of thing in powerpoint. Honestly, whatever works and you can share with someone else is good. There is a whole set of skills around communicating clearly, but just getting the process started is more than half the battle.

Leadership skill: moderation

Coffee conversation threaded through a number of topics this morning, mostly flitting around diversity and how it is a continued challenge. I’m implying in technology fields, but really it’s everywhere. A post by Anne Halsall on Medium that has stuck with me is her post When Women Engineer. If you haven’t seen it, take a moment and give it a solid read-through. It is an excellent post that is looking at the crossover of sexism and how most companies have a very male-oriented bias on how they interpret, well, everything.

When I first read Anne’s post, one of my first thoughts was “Hey, some of what you’re seeing isn’t just male-sexism, it’s corporate-standard stupidity in dealing with different personalities”. There’s some of that in there and Anne reiterated that it wasn’t just lack of dealing with different personalities.

One of the skills critical to leadership is being an effective moderator. The reason applies to some of what Anne saw, but extends through different personalities, and as I’ve concretely learned over the past 18 months, is an excellent tool to help overcome significant cultural differences. “Overcoming” meaning communicating clearly – not forcing a change in interaction that come from different cultures.

Moderating is not just about making sure everyone obeys some basic politeness rules – I think my best metaphor right now is that it’s like working a complex sound engineering board where each of the people in the conversation are one of the inputs. Some are darned quite, some are loud – and leveling those out as best as possible is an obvious part of the effort. But there’s also the communication that may come across as garbeled. What may not be apparent as a need is being not only willing, but pro-active in asking clarifying questions or otherwise changing the flow of the conversation to get out a full understanding. The queues for this can be really darned subtle – a raised eyebrow, furrowed brows, or other more whole-body expressions of surprise – or the tone of the conversation a few minutes later.

A moderator has implicit (sometimes explicit) control of the conversation, and establishing that up front – often as simple as saying “I’ll be modering this meeting” – is important. You can’t be an effective moderator without it, and there are definitely times when it’s forgotten. You may need to wait out a particularly long winded or excessively vocal individual. I personally try not to engage in the “shout them down” tactic, but I’ll admit to having used it too. Honestly, when I’ve had to do that, I figured that I screwed up some better method – it’s just so… inelegant.

There is also a bit of a social contract that you are agreeing to when you’re moderating: that you won’t be injecting your own hidden agenda into the conversation. That is, you won’t be using the conversational control to your end “evil” ends. Hidden is the critical word here – all conversations have a purpose, so making it explicit what that purpose is up front – calling it out before any in-depth conversation has happened – is a good way of getting that into everyone’s heads. From there, it’s paying attention to the conversation flow, the individuals, and guiding it to try and achieve those ends. You may have to reiterate that purpose during the conversation – that’s OK. Plenty of good stuff is found in the eddys of the main flow – don’t stop it entirely, but have a sense of when it’s time to get back on track.

You might have read that good meetings all have agendas. I’m generally in agreement and one of the formulas that I try to use is starting off any conversation with the agenda and goals. That helps in immediately setting the stage for moderating the conversation, as in doing so, you have implicitly asserted that you’re paying attention to that purpose and goal. In the “conversation is a river flowing metaphor”, you stepped up the rudder and said you’d pilot the boat.

This applies all over the place in discussions – a scrum daily standup meeting (which is nicely formulaic, even though I tend to repeat what I’m looking for in the status update), to design brainstorming to technical topic discussions of “the best way to thread that mobius needle”.

One of the characteristics that a moderator has to have (or overcome) is to be willing to engage in what may be perceived as conflict. That is, you have to be willing to step in, contradict, and stop or redirect a conversation. Growing up, I was a pretty conflict-adverse person – so much so that I’d let people  walk over me, and walk over a conversation. I had to really work on that, be willing to step into a conversation, to signal with body language or conversational gambits that we needed to stop and/or re-route the flow of conversation. And yes, you’ll hit those times when all of those mechanisms fail – when emotion has gotten so heated that someone is just raving on – the only thing you can do is to stop the conversation entirely. It may even come across as terrifically rude, but the best thing you can do is get up and step away from the conversation. Sometimes that physically walking out of the room. Another choice is to let the individual who’s really going on just exhaust themselves, but recognize that the conversation may be best to set aside for the time being, or the problem/agenda may need to be reframed entirely.

As a side note, engineers – often male engineers – can be notoriously obtuse or just outright ignorant of body language cues in conversation. Most of the time I think it is non-malicious, but there will be people who intentionally ignore body or even verbal cues in order to continue their point or topic, or to ignore or override others trying to be involved.

A skill I have personally focused on developing (and which I recommend) is the ability to take notes while moderating a discussion. It may sound to you like “How on earth would you have TIME to take notes as well as moderate?!”. The answer: make the time. I am willing to ask for a pause in the conversation while I’m taking notes when backed up a bit, and by writing down the perspectives and summaries of what people are saying, it helps me externalize the content of what has been said. It actually makes it clearer to me, as the very process forces me to summarize and replay back what I thought I heard. I’ve found more instances of what was garbled communication by writing it down. When I heard it I thought I internalized it, but when I tried to right it down I realized that it wasn’t making sense. And then there is the side benefit of having written notes of the meeting, which I recommend saving – as not everyone will remember the conversation, and in some cases may not have been a part of it.

To reiterate, moderating is a skill that I view as critical to good leadership. If you’re leading a team, formally or informally, think about how you can apply it. Think about it, and DO it.  It’s one of the ways a leader can “get roadblocks out of the way”, and if you’re aspiring to lead teams, it’s something you’d do well to invest your time in learning.