Vapor 3 and a few random experiments

This past week I dug more deeply into server-side swift, specifically with Vapor 3. Vapor was interesting because it recently built over SwiftNIO, and initial reports of its performance were very positive. A highly performant HTTP application based framework in a memory safe language? Worth a look!

I have used dynamically typed languages (NodeJS/TypeScript/Javascript and Python) for quite a while, so the biggest shock is transferring back towards the constraints of a strictly typed language. This cascades into how the software is represented at a lot of levels, and really the transfer to “classes, structs, and enums” was the hardest to re-acquaint myself. The piece that feels the weakest (compared to other languages and frameworks) is the testing – the dynamic languages uses the full capabilities of the languages dynamism, and it’s brutally missing from swift. I have become immensely spoiled using testing frameworks like Jasmine, or Mocha and Chai with supertest over express, making a super-easy to read testing framework that works the code directly.

Speaking of BDD, I took a day detour into even trying to use Quick and Nimble, but in the end decided it was more pain than value – and leveraging XCTest, even if writing tests with it felt stunted and somewhat awkward, was a more robust path. It was particularly painful to work with server-side swift, it seems far more robust with IOS projects – but the lack of reflection and XCTest identifying what to run on Linux is atrocious. When I found the SwiftPM command to help collect the tests with XCTest:

swift test --generate-linuxmain

that won the day, and it was back to XCTest.

Vapor 3 itself was very straightforward, although the docs are very rough – and in some cases downright useless. There are multiple points where extensions to Vapor (WebSocket, Auth, and so forth) are not clear on how you attach and use them in their templates and sample code. Fortunately the community (on Discord, rather than Slack and Vapor on StackOverflow) makes up for the difference. The developers who are actively pushing Vapor forward as well as community members are very accessible and willing to answer questions.

As I mentioned earlier, I’m finding the idioms of programming with swift the hardest to get my head around. It is a very different way of thinking about the problem and how to solve it, tending to be fully specified at every level. Structs, extensions, and enums make up most of the structures, often with lots of smaller files in the examples that I’ve been seeing so far.

While it’s very straight forward to read and understand, I find myself struggling to know where to look things up, and how to read documentation to get what I need out of it. In addition, even Apple’s documentation seems significantly weaker than it has in years past. There’s a new style to the documentation that I’m struggling to learn – the ability to read the docs to know what enumeration options should be used, how and when, is definitely a challenge. It is often Q&A and samples in StackOverflow that provide the closing hints or how to use the code in any holistic way that makes a difference.

On the good side, Xcode running Vapor on my laptop was a gem, and I was immediately enthralled with the cpu and memory tracking details that you could see while running the code locally. I haven’t fully explored what you can do, but even just seeing the live CPU and memory tracking on the Vapor application while it’s running is wonderful. In other environments, there would be a lot of infrastructure setup to capture that same level of immediate detail – and it’s just built in with Xcode.

CPU spikes when running “ab” load testing
memory usage over time with the same “ab” load testing

Vapor makes it easy to leverage Xcode, wrapping the SwiftPM tool commands so you can invoke something akin to:

vapor xcode -y

This will regenerate an Xcode project file and open it. Vapor projects also have a number of examples of wrapping the code into a container to run however you like, and the next version of Vapor (4, in development now) will have some “polite shutdown” signal handlers for SIGINT and SIGTERM, which will make it work better with orchestration systems like Kubernetes.

I have this perverse idea of wanting to run this same code that I can put into a container on an IOS device for a quick-shot “mobile server”. Yes – I know there are issues with IOS and activating the relevant devices through SwiftNIO, but the idea of having my own portable server as an IOS app is really appealing.

Vapor 3 is all based on Swift Package Manager, for which there’s no (yet) direct Xcode support. It looks like it may be possible to use Xcode’s cross-project linking to have an SPM based Xcode project working with a more classic IOS one using the project as a dependency. There’s an article on how this can work called Bringing Swift NIO to the iPhone, and a similar reference, a walk-through how-to in the swift forums. I haven’t wrapped my sample Vapor 3 project into an IOS application yet, but I’ll be giving that a shot shortly.

Adding tracing with Jaeger to an express application

I have been following distributed tracing technologies – Zipkin, OpenTracing, Jaeger, and others – for several years, without deeply trialing with any of them. Just prior to the holidays, we were having a number of those “why is this slow?” questions about an express application, written in typescript, providing an API endpoint. The API fronts multiple different data sources – MongoDB, InfluxDB, and Redis, and we run it all in containers, deployed and orchestrated with Kubernetes.

I decided that with some quieter time with the holiday, this was the perfect time to experiment with and take a stab at seriously implementing some form of tracing and external presentation to shed some light on what was happening within the API endpoint, so we could have some good conversations about the scope and structure of what’s being returned in the various implemented API endpoints.

I wrote about Jaeger (and OpenTracing) in Kubernetes for Developers, which provided me with some passing familiarity. In addition to prior research, I choose to go that route again to see what we could learn. A new complication to the scenario was that we wrote our API server using TypeScript. While the concepts are well documented – there aren’t any details tutorials or examples showing how to enabling tracing in your typescript/express application. My goal with this article is to provide one of those detailed example that may help to provide a roadmap (or highlight potholes) for others who are interested in doing the same.

The resources that I used to source and start the learning included the book I wrote (Kubernetes for Developers), and also:

  • – more marketing than detail, but provided a starting point for finding other resources and hinting at where to look
  • – the source repository for the NodeJS jaeger client implementation, primarily the README (outside of when I needed to look at how something was implemented) for developer level notes about the implementation details.
  • – the real source for a lot of the starting points, with concrete implementation examples, even if they didn’t always follow the same patterns I needed. YuriSkuro did some lovely work with providing these examples. I didn’t hassle him in learning or using any of this, so any misinterpretation of what was expected is mine.

Getting a toe hold and seeing a trace

The first step in my journey was to get tracing enabled and see any single result. There are a lot more examples for Go than Express/NodejS. Adding in TypeScript and getting the libraries to work seamlessly within the type-constrained interfaces wasn’t entirely obvious.

The client itself installs cleanly with npm or yarn, as does opentracing, so getting the libraries was as simple as:

npm install opentracing jaeger-client --save

The jaeger client for node was written using Flow, which is a different pattern of adding in type constraints on top of Javascript, and doesn’t translate directly to TypeScript. Fortunately, it implements OpenTracing, and the OpenTracing offers a typescript typed definition library, so it was possible to load that and use OpenTracing’s definitions with jaeger clients implementation.

In the end, I get this operational in typescript with:

import * as opentracing from 'opentracing';

From there, you can set up the overall tracing client to work with your code. The tutorials inspired some of this, but the gist of the setup is to define an initialization function that passes in the tracer configuration, including sampling and reporting of when traces are captured. For getting a toe-hold and just starting,

const initJaegerTracer = require('jaeger-client').initTracer;
function initTracer(serviceName: string) {
const config = {
serviceName: serviceName,
sampler: {
type: 'const',
param: 1,
reporter: {
logSpans: false, // this logs whenever we send a span
const options = {
logger: {
info: function logInfo(msg: string) {
console.log('INFO ', msg);
error: function logError(msg: string) {
console.log('ERROR ', msg);
return initJaegerTracer(config, options);

export const tracer = initTracer('my-service-api') as opentracing.Tracer;

I chose to use the ‘const’ sampler (implying “trace everything”), and while I had logSpans set to true to see how it was working, I switched it to false as soon as I saw it was operational, as it was generating a huge amount of additional console logging output that was cluttering up my output.

logger in this function looks like it was built so that you could pass in any of a variety of logging functions (an example of which might be Winston). For getting started, I just stuck with some simple console.log statements.

Once I had the tracer set up, then I could use that to create spans, annotate them, finish them, and let the jaeger-client library take care of dealing with getting it outside of my code and to it’s external destination. This also implies there’s somewhere to capture and display this information, and the JaegerClient getting started documentation came in useful here.

JeagerClient sets up and offers an all-in-one memory store and UI for collecting and visualizing traces that’s perfect for running in a docker container right on your development machine.

I used the example directly, exposing all the ports (even though I only use a few) and it fired right up:

docker run -d --name jaeger \


 -p 5775:5775/udp \

-p 6831:6831/udp \

-p 6832:6832/udp \

 -p 5778:5778 \

 -p 16686:16686 \

-p 14268:14268 \

 -p 9411:9411 \


Something to note (maybe obvious, but I’ll mention it): to kill off this background container, you use the command:

docker kill jaeger

and before you try and run it again, you may want to “clean it up” and remove it. You can do that with the command:

 docker rm /jaeger

Once it’s up and running, you can access port 16686 on localhost (open http://localhost:16686) to see the jaeger UI (it is boring until you get traces into it)

With an endpoint ready to receive, you can start to create spans and send them in. YuriSkuro’s tutorial code (lesson 2) really cemented this piece for me, and it was relatively light work to translate it to using with TypeScript.

I started off by picking one express route that we’d already implemented and enabling it with a trace. Setting up a trace was pretty straight forward, and the pattern quickly emerged for how I’d repeat this, so it moved into a function to simplify the code in the route.

Creating a span is primarily calling tracer.startSpan. When you’re doing that, you can create the current span as stand-alone, or you can build it up with references to other spans. This can either be another span reference directly, or you can extract the span context information from somewhere else and use it. For express routes, there are some handy helper utilities to look for and pull a tracing context out of the HTTP request headers, so the logic I resolved upon here was to try and pull that context, and if that does not exist, then go ahead and create a stand-alone span.

I’m still experimenting with what and how much to annotate onto spans, and so far I’ve chosen to use the function names from the routes as the span names, and to annotate the controller for a bit of grouping.

export function createControllerSpan(controller: string, operation: string, headers: any) {
let traceSpan: opentracing.Span;
// NOTE: OpenTracing type definitions at
// <>
const parentSpanContext = tracer.extract(opentracing.FORMAT_HTTP_HEADERS, headers);
if (parentSpanContext) {
traceSpan = tracer.startSpan(operation, {
childOf: parentSpanContext,
tags: {
[opentracing.Tags.SPAN_KIND]: opentracing.Tags.SPAN_KIND_RPC_SERVER,
[opentracing.Tags.COMPONENT]: controller
} else {
traceSpan = tracer.startSpan(operation, {
tags: {
[opentracing.Tags.SPAN_KIND]: opentracing.Tags.SPAN_KIND_RPC_SERVER,
[opentracing.Tags.COMPONENT]: controller
return traceSpan;

With the span created, the other critical piece is to report back when the span should be marked as completed. The function finish() on a span can be invoked directly, but it quickly became apparent that it was handy to also annotate the return result code and if it should be considered an error. This ends up being more handy in the UI than anything else, letting you filter down to “interesting” traces.

I ended up also encapsulating that into a function:

export function finishSpanWithResult(span: opentracing.Span, status: Number, errorTag?: boolean) {
span.setTag(opentracing.Tags.HTTP_STATUS_CODE, status);
if (errorTag) {
span.setTag(opentracing.Tags.ERROR, true);

I’m not particularly thrilled with the errorTag?: boolean optional on the tail end of this function, I’m sure there’s more aesthetic ways to handle it. I’m still prone to think of function signatures with named variables that is common from other languages (python), so take my specific implementation with a grain of salt.

Between these two functions, I have the pieces to create and finish spans, which results in code within an express route that looks something akin to:

import { createControllerSpan, finishSpanWithResult } from '../helpers/tracing';
import * as opentracing from 'opentracing';

# ...

# within the express route function:
const traceSpan = createControllerSpan('myController', 'doSomethingCool', args.headers);
# if something not found
finishSpanWithResult(traceSpan, 404);
return res.status(404).send();
try {
# do work here
# resulting in a resultList object that we'll return:
finishSpanWithResult(traceSpan, 200);
return res.status(200).json(resultList);
} catch (error) {
finishSpanWithResult(traceSpan, 500, true);
console.log('error while listing things ', error);
return res.status(500).send();

With this in place, and the code’s previously existing tests (which leverage supertest to work the API endpoint), just running the unit tests was generating traces, which turned out to be immensely useful.

I’m fairly confident that the logic that I put into my routes could have been encapsulated within an express middleware, and we may take that path in the future. We are also using Swagger 2.0 as a definition layer, and the swagger-express client tooling sets up each of the operationId as middleware functions, so it was not as convenient to try that with our particular setup.

Adding this code in to every route was frankly pretty tedious, but I also took advantage of the time it took to normalize the pattern of each of routes to a more consistent structure, and while it was a bit time consuming, it wasn’t hard and the end result was structurally more consistent routes between all the controllers.

This level of implementation gets you a view of “how long” an API call took, but not the detail within it of why it took so long. A necessary first step, it was valuable in itself but the overhead of Jaeger and OpenTracing to get just this information was akin to swatting a fly with a flamethrower. The real value comes from extending the tracing so you can see what within the API call took time, and how long it took.

Datagram size problems

Just this level of tracing was enough to highlight a quirk that’s worth calling out. While implementing and running the tests to make sure I didn’t break anything, I started seeing a repeated error message in the unit test console output, starting with EMSGERROR. After some digging, it turns out that this error message was being caused by MacOS’s default datagram size being MUCH smaller than the jaegertracing libraries expected.

It boils down to how JaegerClient chose to transmit data and keep the process of collecting and transmitting this additional as light as possible. The client collects the traces and then manages the transmission of them in the background using the UDP network protocol. This is basically a “fire and forget” data packet, the packet of which is called a Datagram, and the size of which is set by your operating system. The default on Linux (and most Unix OS) is 65,536, but the built-in default that MacOS uses is 9216 bytes. The error that was appearing was JaegerClient reporting that it was unable to compose and send the traces because it always expected to use the larger datagram sizes.

The solution was fairly straightforward – tell the local OS to allow those datagrams to be quite a bit larger (match what Linux was doing). On MacOS you can view the size it’s currently set to with the command:

sysctl net.inet.udp.maxdgram

and you can set it with:

sudo sysctl net.inet.udp.maxdgram=65536

Once I did that, the error messages disappeared, and I saw significantly more traces hitting Jaeger that I could dig through.

But what is it doing?

At this point, we had nice visualization of the API call timing, but I wanted to leverage the wins of tracing with adding sub-spans to break up the work that was happening. Our API code works with multiple external data resources, so that was the natural place I wanted to wrap with tracing.

We are using Mongoose and it’s schema, as well as the Influx node client libraries to access these remote resources. I’m sure with more work I could have made the wrapping less intrusive, but with a few additional helper methods I was able to quickly get 90% of the value and view into “what’s taking how much time”.

First I made a method to initiate a span within a mongoose Schema method (we use the virtual methods within Mongoose to consolidate our data models in the code):

export function createSchemaSpan(schemaName: string, operation: string, parentSpan?: opentracing.Span) {
if (parentSpan) {
return tracer.startSpan(operation, {
childOf: parentSpan,
tags: {
[opentracing.Tags.SPAN_KIND]: opentracing.Tags.SPAN_KIND_RPC_SERVER,
[opentracing.Tags.COMPONENT]: schemaName
} else {
return tracer.startSpan(operation, {
tags: {
[opentracing.Tags.SPAN_KIND]: opentracing.Tags.SPAN_KIND_RPC_SERVER,
[opentracing.Tags.COMPONENT]: schemaName

The logic follows the same pattern as the express route span creation, but instead of taking headers apart the looking for a span context, it’s set up to accept a parent span, which is much more readibly available at the level of the mongoose virtual method function call. I then appended an optional parentTraceSpan parameter into each of those function calls (which are typically used form the express routes), and used that with this function to set up the span for any work that happened, and continue the cascade down.

The other place that was getting some MongoDB time was doing document queries. This got a little awkward with TypeScript, mostly because I did not feel super confident in how generics work and parameterizing functions like this, so I took a more ugly shortcut and did some down-casting of the types, and re-cast them on result, so that I could write a wrapper that consistently traced the call.

The end result is this function:

export async function traceMongoQuery(parentSpan: opentracing.Span, traceName: string, documentQuery: DocumentQuery<any, any>) {
const traceSpan = tracer.startSpan(traceName, {
childOf: parentSpan,
tags: {
[opentracing.Tags.SPAN_KIND]: opentracing.Tags.SPAN_KIND_RPC_SERVER,
[opentracing.Tags.COMPONENT]: 'mongodb'
const documentResult = await documentQuery;
return documentResult;

which ends up getting used in a pattern like:

const possibleSite = await traceMongoQuery(traceSpan, 'Site.findOne', Site.findOne({ name: siteId })) as ISiteModel;

It means that when I trace the query, I need to know what should be coming back out so I can cast it back into the right object model, which is a little stupid, but kept everything working. Certainly not my proudest moment, but it’s functional.

The other data source we use is Influx, and a separate method when into wrapping those queries, which had a less heavily typed setup than what we had with MongoDB and mongooose in TypeScript.

The function I used to wrap the influx queries:

// return type for an influx query is Promise<influx.IResults<{}>>
export async function tracedInfluxQuery(
query: string,
options?: influx.IQueryOptions,
parentSpan?: opentracing.Span,
spanName: string = 'influx-query'
): Promise<influx.IResults<{}>> {
let traceSpan: opentracing.Span;
if (parentSpan) {
traceSpan = tracer.startSpan(spanName, {
childOf: parentSpan,
tags: {
[opentracing.Tags.SPAN_KIND]: opentracing.Tags.SPAN_KIND_RPC_SERVER,
'db.type': 'influxdb',
'db.statement': query
} else {
traceSpan = tracer.startSpan(spanName, {
tags: {
[opentracing.Tags.SPAN_KIND]: opentracing.Tags.SPAN_KIND_RPC_SERVER,
'db.type': 'influxdb',
'db.statement': query
const client = influxClient();
const result = await client.query(query, options);
return result;

You can see this allowed me even embed the specific query into the span. It’s a lot of meta-data baggage to add to a span, and I think it’ll be useful, but other than cool to see it hasn’t paid off yet. As we promote this experiment into active systems with real usage, the tradeoff of consuming more space and bandwidth against the value will hopefully become a bit more clear. I might be able to get just as much value from a shorter tag, or a more meaningful grouping of the span names, instead of the explicit query itself.

Much of the time consuming lifting we’re doing is with time series data stored in InfluxDB, so while I wanted to see the mongoDB included to provide the context, I expected more detail to be useful when sub-spans were doing something with influx – and in the end, that was a pretty good assumption.

There were a few places in the code where I manually created a subspan and wrapped some logic in the code with it. We use async/await through-out our code to keep it easier to understand and reason about, so I was tempted to try and wrap the promise and make that a nice boundary for a span, but in the end I found just a few strategic additions around heavier code in the routes or elsewhere did what I wanted, and was a bit simpler.

Bringing it all together and seeing results

In the end, I have nearly all functions that are called from the API endpoints accepting a span as an optional parameter, and all the functions were updated to create spans regardless of a parent, using the logic above. With the tests so actively working the code at various levels, just running our local tests generated a bunch of traces, and I found quite a bit of value in seeing the progress I was making by running the test suite and then viewing all the traces reported into Jaeger on my desktop.

Some of the heavier lifting in the code is post-processing data we get back from influx – some of which can be quite extensive. When we wrapped some of the “what appears to be simple functional code”, it quickly highlighted how expensive getting that data (with our chosen way or algorithm) was – and it highlighted a few O(n^2) methods that were a lot more obvious in hindsight than in reading the code.

I also found that it was relatively easy to forget to close off a span, and when you do that, what happens is that you simply don’t see that span reported in Jaeger. So I quickly found myself focusing on code to make sure the sub-spans I expected to be there showed up as a means of knowing if I’d wrapped the logic correctly.

In the end, getting to this point ended up consuming probably 16 to 20 hours of time, and was mostly repetitious work that you could easily see progress on while you were doing it. And even at this point, with just the local test suite driving effort, there was a clear and valuable pay-off that starts to highlight “why some things are taking a while”.

Deploying to staging

The next step was to see it operating with some closer-than-test-suite real data. The infrastructure for astorage system for traces was more than I wanted to tackle to start getting data back. Jaeger does a good job of setting it up, but there’s a lot moving parts for a “downtime holiday project”.

We run all our code in a kubernetes cluster, and since I had a ephemeral memory-based container already available, I made a service of it with Kubernetes. JaegerTracing has some Kubernetes templates for those inclined to explore, and they helped me understand what will be needed in the future for a more solid, use-it-constantly implementation of a service.

I ended up forking what they had available and using it to make my own “ephemeral jaeger” using just in-memory storage.

# Copyright 2017-2018 The Jaeger Authors
# Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except
# in compliance with the License. You may obtain a copy of the License at
# Unless required by applicable law or agreed to in writing, software distributed under the License
# is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
# or implied. See the License for the specific language governing permissions and limitations under
# the License.
apiVersion: v1
kind: List
- apiVersion: extensions/v1beta1
kind: Deployment
name: jaeger-deployment
app: jaeger
jaeger-infra: jaeger-deployment
replicas: 1
type: Recreate
app: jaeger
jaeger-infra: jaeger-pod
annotations: "true" "16686"
- env:
value: "9411"
image: jaegertracing/all-in-one
# all-in-one Dockerfile: <>
# CMD ["--sampling.strategies-file=/etc/jaeger/sampling_strategies.json"]
args: ["--sampling.strategies-file=/etc/jaeger/sampling_strategies.json", "--memory.max-traces=20000"]
# all-in-one image is an in-memory image, with no default limit on how much memory will be consumed capturing
# traces, so this puts a little bit of a limit on it.
name: jaeger
- containerPort: 5775
protocol: UDP
- containerPort: 6831
protocol: UDP
- containerPort: 6832
protocol: UDP
- containerPort: 5778
protocol: TCP
- containerPort: 16686
protocol: TCP
- containerPort: 9411
protocol: TCP
path: "/"
port: 14269
initialDelaySeconds: 5
- apiVersion: v1
kind: Service
name: jaeger-query
app: jaeger
jaeger-infra: jaeger-service
- name: query-http
port: 80
protocol: TCP
targetPort: 16686
jaeger-infra: jaeger-pod
type: NodePort
- apiVersion: v1
kind: Service
name: jaeger-collector
app: jaeger
jaeger-infra: collector-service
- name: jaeger-collector-tchannel
port: 14267
protocol: TCP
targetPort: 14267
- name: jaeger-collector-http
port: 14268
protocol: TCP
targetPort: 14268
- name: jaeger-collector-zipkin
port: 9411
protocol: TCP
targetPort: 9411
jaeger-infra: jaeger-pod
type: ClusterIP
- apiVersion: v1
kind: Service
name: jaeger-agent
app: jaeger
jaeger-infra: agent-service
- name: agent-zipkin-thrift
port: 5775
protocol: UDP
targetPort: 5775
- name: agent-compact
port: 6831
protocol: UDP
targetPort: 6831
- name: agent-binary
port: 6832
protocol: UDP
targetPort: 6832
- name: agent-configs
port: 5778
protocol: TCP
targetPort: 5778
clusterIP: None
jaeger-infra: jaeger-pod
- apiVersion: v1
kind: Service
name: zipkin
app: jaeger
jaeger-infra: zipkin-service
- name: jaeger-collector-zipkin
port: 9411
protocol: TCP
targetPort: 9411
clusterIP: None
jaeger-infra: jaeger-pod

Most of this detail is stock from their examples, and offers a bunch of functionality that I’m not even touching (alternate transports and offering drop-in replacement functionality to accept Zipkin traces).

Buried in this detail is an option that I found on the all-in-one container to limit the number of traces that would be stored in memory, and since I wasn’t putting many other limits on this system, I wanted to put in at least some constraining factor. You can find it in args in the pod spec:

args: ["--sampling.strategies-file=/etc/jaeger/sampling_strategies.json", "--memory.max-traces=20000"]

20,000 traces may end up being way to small, but for an ephemeral service implementation, I thought it made a reasonable starting point.

The other trick needed to make this all work in Kubernetes was adding something to listen for the UDP data was being sent that can on the same Pod as the API code, and forward that into this ephemeral service I just set up. This part I had in notes from my work on Kubernetes for Developer, so I picked it back up and used it again almost verbatim. The end result is specifying two containers (this pattern is called a “side car”) in the template that we used for the API code, so the additions end up looking like:

{{- if .Values.tracing.enabled }}
- name: jaeger-agent
image: "{{ .Values.tracing.image }}:{{ .Values.tracing.tag }}"
- containerPort: 5775
protocol: UDP
- containerPort: 5778
- containerPort: 6831
protocol: UDP
- containerPort: 6832
protocol: UDP
- "/go/bin/agent-linux"
- ""
{{- end }}

We use Helm for the deployment into a cluster, and this is a snippet from the deployment template in our helm chart. The default values we’re using for the agent are:

  • image: “jaegertracing/jaeger-agent”
  • tag: “1.8”

With this in place, we deployed the tracing basics into our staging environment where we have more realistic integrations and data flowing, the end result of which is a tremendously more valuable solution. We access the Jaeger console with a port-forward right now, and this provided us with a way to quickly open a window into what was taking time behind the API calls and start to get to some really good conversations with this data as a base.

Where next?

This whole holiday experiment has been a great success so far, but there’s more work to do to use it seriously. Rolling the ephemeral tracing infrastructure into our production environment and switching from constant/trace everything to a probabilistic tracer is likely one the first tasks.

The API in question is the back-end not directly for our customers, but for an Angular web application in a browser. How we’ve structured those calls from Angular, when we’re doing them, and how much we try to do in each call is where I expect to get some immediate value, as well as being able to answer the questions behind “why is it taking so long for this to render?”

I have this crazy idea about viewing portions of the single page application (what gets displayed) as a sort of costed effort. Now that we have “time taken” and the visibility o where it goes, we can look at what we load, when we load it, and “how much it costs” – from a time and user responsiveness point of view. So maybe we could annotate a wireframe with “how expensive” each data visualization piece is, even if it’s manually done with on a whiteboard, using Jaeger data about the “time-expense” of the API needs needed as the source for that information.

It’s also quite possible that I’m a little too “budget concept” happy, and the majority of value will be in developers, designers, and product managers in our company just being able to have a concrete way to see and address the desire to make really outstanding web-based user interfaces.

When we have it loaded into our production environment, then getting familiar with Jaeger filtering and that UI may also support us identifying problems quite a bit more effectively, and I’m actively looking forward to seeing this bolster the data we’re already capturing with our production monitoring with Prometheus.

And finally, I’m also interested to see where Grafana 6.0 goes in the future, which provides such lovely overlays for metrics capture, building now into logging, and perhaps linking that to traceing in the future.

I’d love to be able to add what’s happening in the Angular application and cascade the tracing from there. I don’t know of the way (outside of a lot more service infrastructure, and getting a LOT more familiar with jaeger-client and making a variant to work from a browser) to capture and forward traces. I’m honestly not sure I’d always want it in place, where I rather do want it always there and on at the API layer, at least while it’s not overly negatively impacting the service calls. There might be something interesting we can do for a single browser with a special setup that would allow us to profile from the user interface all the way back to the services, so there’s another interesting thought experiment for another down-time.

Underneath your code

Underneath your code is the working title for a new writing project. I started coming up with the idea for this a couple of months ago based on what I’ve been learning and hearing. Some of this comes as a follow-up from Kubernetes for Developers, and some of this is just aiming to help new developers.

A lesson that has been reinforced for me recently is that I can too easily assume a breadth of knowledge that doesn’t exist. I have been working with a couple of new developers recently, and there have been a number of concepts they weren’t yet familiar with that caught me by surprise. As I went looking for public resources to help them learn and dig into the topics they were interested in learning more deeply, what I found was a lack of cohesive and consistent information. All the detail is out there, even freely available, but scattered in a lot of different places. It is hard to pull together, and even harder to make the connections of why some of these details are interesting or relevant.

I’ve recently also seen some pushback when it comes to Kubernetes and Developers. The general meme as I’m perceiving it is “don’t expose Kubernetes to developers, they don’t need to know or it’s just too confusing”. Ahmet has great points about it being confusing or perhaps not being the best or correct abstraction over operational needs, but I think backing away from making Kubernetes more accessible and useful to developers is the worst possible path. Outside of my own opinion that using Kubernetes is a good and reasonably efficient way to help keep software running reliably and consistently,  I think developers should absolutely know the world in which their code lives, and the details of how that runs, or not. Lots of general operational concepts are just unknown, confusing, or perceived as irrelevant or that should be hidden. I think it’s a lot better to be able to have a holistic view and knowingly choose what you hide and abstract away, and even more so what constraints come with those choices.

What I am starting is the next layer down from what I wanted to create with Kubernetes for Developers. Instead of writing about an operational tool for developers, I am aiming at writing about the underlying concepts, systems, and common mechanisms. My goal is to make that information accessible and understandable for new developers. My first (terrible) working title was “Ops for Devs” as a lousy inverse of “DevOps”.

I thought about writing a book in the classic fashion: editor, publisher, etc – but only for a very brief period. I want to take this in a different direction. I want to publish and promote this work to a broad set of new developers. To enable that goal I want to make the content solidly open and freely accessible. At the same time, I also want to get professional editing help (they are such a godsend for writers), so I am aiming to collect all the pieces together as a PDF, ePub, and mobi and make it available for sale as a means of funding an editor, and perhaps some of the other niceties such as graphics and illustration.

The work from Julia Evans and her amazing and accessible Zines has been an inspiration. The comic style isn’t what I feel comfortable creating, but I’m constantly referring people to specific zines as a good intro/starting point for a variety of topics, and as a fun seed that can lead them to digging deeper and more broadly.

Tooling is a bit of an open question. I have written in word processors (Scrivener, MS word, google docs), using markup (reStructured Text, Markdown), and use a variety of publishing/rendering tools (Sphinx, ReadTheDocs, and more recently Hugo). Something that can be stored in plain text and usefully read from there is important to me, so I’ll probably ditch the proprietary word processors and see what I can do with content sourced in GitHub, GitLab, or Bitbucket.

Just recently I started looking at GitBook and LeanPub. I heard some great things about LeanPub recently, but my own experiments haven’t been very successful:

The lack of my success with LeanPub may be a strong constraint on not having 2-factor auth enabled for Github in order to use their service, which just scuttles me – as the Kubernetes organization very reasonably requires 2-factor auth to be enabled. It’s not entirely clear if this is the problem or not though, because when I render sample content I get just a vague “something broke, go check what you changed” message with no details about the failure for me to diagnose.

With the what and where still up in the air, I’ll probably take a little time to look at AsciiDoc and AsciiDoctor for the generation. To get started, I fell back to a tool I’ve loved in the past: Scrivener, and took an initial stab at doing the markup for it in Sphinx and hosting it on ReadTheDocs.

How ever I end up playing it, an editor (or editors) and technical reviewers will definitely fit in. I was disappointed with my “editor experience” using the last publisher I worked with, but I have met so many great editors out there that I think it will be quite possible to find one (or a couple) to work with and pay them directly. Even though I had to back away due to time constraints a few months ago, working closely with the Kubernetes Docs team was a great experience, and really cemented the idea that I could find editorial help outside of a larger publisher.

Having done technical writing on and off for a couple of decades now, I’m very familiar (and slowly getting comfortable) with how fundamentally ephemeral it really is. The content from Kubernetes for Developers may have an 18+ month lifespan in terms of being useful, and when I started the project I expected that to be as short as 12 months for a useful lifetime. Looking at the work six months after publication, and reviewing how the Kubernetes project has evolved – it looks like I managed to nail enough core content that at least half of it will be viable for quite a bit longer.

One of my struggles while writing was the amount of time and effort it took to do good writing for the relatively limited lifespan of the content. Technology changes fast; products and platforms develop and evolve in weeks and months. They change, grow, and yeah – die off as well. I view it as an interesting challenge, and my current thinking is to work on shorter, more directed topics rather than larger, more expansive, volumes of work. I also want to keep a published reference material clear on its biases – what’s experimental and strongly opinionated I prefer to keep or see in a blog rather than what I think of as a cohesive publication.

The space for more exploratory, short, and opinionated efforts is still critical. Where the options are evolving quickly (for example, service meshes and ingress options in Kubernetes) I think it may be better to get at least some “how to” information available, over making it fully integrated and cohesive.

There is no denying the success and value of sites like StackOverflow, of which I’m simultaneously grateful and annoyed. The content on the site is a beautiful example of caveat emptor and the need for critical thinking in reviewing and accepting the answers. It is also an amazing communal resource for information, answers, how-to, and examples. By its very nature it lacks a cohesive voice, style, or guide to what is available – and it is a great resource as it stands. What it does well, and what I find lacking from it, are influencing what I want to write, as well as how I might organize it.

The loose outline that I’m starting with for Underneath Your Code:

  • Where your code runs – what a “process” is
    • Running code in a browser vs. running directly on an OS
  • Operating system basics
    • Processes, file systems, memory, and networks
    • A little deeper on networks: DNS, IP addresses, ports & sockets
    • IP v4 and IP v6, TCP & UDP, DHCP, ZeroConf/Avahi
    • WTF is a 7 layer ISO model anyway, and why do I care?
  • Physical & virtual devices and IO
    • Serial ports, block and character devices, what is POSIX
    • USB and Bluetooth
    • Specialized hardware (GPUs, Accelerators, etc)
  • Practicals for working with processes
    • shell scripts and some of Unix CLI concepts
    • commands, pipes, STDOUT, STDERR, STDIN
    • environment variables, and shell tests
  • Some basics about how operating systems work
    • Init, systems, hierarchy of processes and how they coordinate
    • Kernel vs. user space and permissions, and privileges
    • Memory, Buffers, and the various dials that can be tuned (queue theory)
    • Sandboxing, background tasks, and scheduling
  • Another layer of indirection
    • Containers and VMs
    • Shared vs emulated resources
    • Cloud resources to IoT: you have a budget: CPU, Mem, IO
    • Smaller and smaller bits of computation – microprocessors and embedded devices
  • Physical stuff breaks, all the time
    • Redundancy and consistency
    • Networks, latency, and information theory
    • Why and how these fundamentals expose constraints for development work

Almost all of the topics in the outline could be (and in many cases are) books in their own right. While aiming to make an overview accessible and understandable, I am not covering every corner and case. This may be more sensibly organized as several works. I’m sure I’ll re-organize it several times when I get into the writing and trying to keep a more-or-less consistent narrative.

If you have feedback or thoughts on what would be useful, I’m all ears. You can poke me on social media or leave a comment here.


Review of using Helm to package and host applications

The open source project Helm represents itself as a package manager for Kubernetes. It does that pretty darn well, having been attached to the project from the earliest days of Kubernetes, and continues to evolve alongside, and now a bit more separately from, the Kubernetes project.

Looking at Helm version 2, it provides three main values

  • As a project, it coordinates and reinforces a number of “best practices” in how to do that generation by housing a public collection of some of the most common/popular open source projects, packaged and “ready to use” within a Kubernetes cluster.
  • It provides a templating solution to generate the plethora of YAML files that make up the descriptions and definitions of Kubernetes resources.
  • It provides a “single command” tool for deploying one or more projects to an existing Kubernetes cluster.

The first two values have been the most meaningful to me, although with some definite caveats and gotchas. The third I thought would be more valuable. I still use the single-deploy-command regularly, although I’m questioning if it is a crutch that will ultimately be a trouble point.

More in depth for each of these:

Package Repository

The single most powerful aspect of Helm isn’t the code itself, but the result of the community and contributions those community members have wrought. The charts collection is the defacto set of examples of how to organize, structure the inputs, and run software with the resource concepts of Kubernetes.

While it calls itself a package manager, there’s a gap if what you are expecting from it is a single binary package that is installable – the moral equivalent of an .rpm .deb, .apk, or installer exe file. You don’t get a single file – Instead you get a set of configuration values alongside a set of templates. The configuration values are the defaults used with the templates to generate the description of the Kubernetes resources. These default values can also be overridden when you invoke helm, which is a godsend for scripted deployment (such as CI/CD). The resource descriptions generated from the templates expect (and require) the actual software you’re running to be available via one of the public container registries –, quay and DockerHub being the three most common referenced. The software you’ll actually be running – the container images – is (intentionally) not included within the helm chart.

If you want to run your Kubernetes cluster where it can’t take advantage of the public internet and those resources, be aware and prepared. I would not be surprised to see caching proxies of those services develop, much like Artifactory and Nexus developed for the Maven build tooling. In fact, I would keep a close eye on Harbor (technically open source, but dominated by the VMware) to see what might develop to help them deploy in more isolated scenarios. It is not all that difficult to use private and local container repositories, just be aware the public packages expect to use the public container repositories.

Pervasively embedded within the templates is a fairly robust and opinionated set of how to take advantage of Kubernetes. The content of the  templates contains a ton of knowledge, but be aware it is not always consistently applied or exposed. Like many projects it has learned from experience, so newer and more recently updated charts will often reflect different views of what is important and useful for deployment. None the less, they provide a huge number of functional and effective patterns for running software.

These patterns are the strongest where the features have existed and been stable within Kubernetes for a while – Pods, ReplicaSets, and the effective use of the side car pattern for buckling on ancillary functionality or concerns. It is weaker (or perhaps viewed differently: various levels and consistencies of workarounds were created) for some of the new features in Kubernetes, including StatefulSets and even Deployments.

In some cases, early workarounds were so heavily embedded that they persisted long after the need existed: the concept of Helm “deploying and managing state” was a filler to the gap of Kubernetes not having a solid Deployment concept earlier, and the whole world of custom resources and extending Kubernetes with operators overlaps with what Helm enabled with hooks. My perception is that both Kubernetes and Helm charts are struggling with how to best deploy these newer structures, which themselves represent often some operational knowledge or intended patterns.

Like the virtual machine focused brethren (Chef and Puppet) before them, Helm added testing and validation for their charts. The chart validation has expanded significantly in the past 18 months. Like any validation, they do not guarantee 100% effectiveness. Even still, I believe it is important to be willing to review what the chart is doing, and how it’s doing it, before using it. The instances of charts failing with newer versions of Kubernetes has decreased significantly, primarily due to the focus of the Helm community on recognizing and working to expose it as a problem and resolve it when it occurs.


A bit background – Kubernetes resources can be defined in either JSON or YAML, and are a declarative structure: a desired state of what you want to exist. These structures are loosely coupled, “tied together” with the concept of labels and selectors. This is both a blessing and curse, providing a lot of flexibility, but if you typo or mismatch some of those connections, there can be very little – to no – checking and it can be quite difficult to debug.

In creating these resource manifests, you will often find yourself repeating the same information, sometimes many times – or explicitly using repetition to tie pieces together. It is ripe for the solution that developed to this repetition and boilerplate overhead: templating.

Helm uses (and exposes) the Go template library Sprig, to greater and lesser degrees. From using the templating language, my opinion is that it is no better (or worse) than any other templating system. It has many of the same concepts that you might find in other templating systems, so if you are already familiar with a templating language, picking up the one used by Helm may be awkward but really is not too bad.

There are variants in other projects that enable a similar functionality to Helm (KSonnet, and the now mostly ice-boxed Fabric8). Even with competition, the network effects from Helm’s collection of charts makes it very hard to compete. Most solutions in this space have to make a choice of how much of a language to build vs. how simple the templates are to use – a continuum between a fully fledged programming language and simple, targeted replacement of values. Helm’s choice adds in some language structures (concepts of making and using a variable, and transforming values), but holds back from the slippery slope into a fully custom language.

We will see if that holds true with Helm version 3’s development, which will be adding the language Lua into the mix, although it appears more for handling the deployment hooks aspect.

If you are a NodeJS, Ruby, or Python developer and looking at the charts, you may have more confusion around what the resource you’re trying to create should look like rather than any trouble with the templating language itself. The templating does nothing to encapsulate or simplify Kubernetes resources and how they may be used together (or not). Helm itself has two commands that have been lifesavers in learning the templating and using charts:

helm template


helm –debug –dry-run

These two commands run the templating engine and dump (with slight variances in what they’re expecting) the rendered results. You still end up seeing (and having to deal with) the “wall of YAML” that is Kubernetes, but these two commands at least make it easy to see the results of the templates after they render.


The exciting (yes, I get excited about weird things) capability of Helm to deploy my applications in a single command reinforced the the concept of it being a package manager, but may ultimately be the biggest crutch of the solution.

As mentioned earlier, Helm “grew up” with Kubernetes, and was alongside the project from its earliest days, covering the gaps from the core of Kubernetes to the cold hard reality of “getting your stuff running” in a cluster. Helms’s concept of Tiller may be the earliest seed of an operator, as it installs itself into your cluster and then handles the deployment and management of resources that it manages. This same capability is more recently codified into custom resources and the operator pattern, as well as the simplest and most common use cases being covered by the Deployment resource and the associated controller.

When Kubernetes finally included RBAC as a default, Helm (and how tiller was installed) illuminated a bit of a hole in how many people were using and deploying software. There was a lot of work exposing and thinking about how to properly secure Helm. Helm 3 will be removing Tiller from the concept of Helm, continuing to evolve with Kubernetes features.

You also don’t strictly need to use this capability of Helm, although it is darned alluring. As mentioned in the section on templating, you can render charts (and their templates) locally and use tools such as kubectl to apply the resulting resources to a cluster.

Having a single command that is easy to add into a script has been a godsend for continuous deployment scenarios. It is what powers GitLab’s “AutoDevOps” hosted continuous deployment. I use the deploy-with-a-single-command myself, and plan to continue to do so, but it comes with a price.

Helm likes to own the whole lifecycle of the release and does not expect or accommodate anything else interacting with the stuff it is managing. In many cases, this is completely fine, but in some cases it can be a pain in the butt. The most common scenario usually involves some manner of persistence – where you want to install and get the software running, and need it operational to do further configuration on how you want to use it. This could be anything from linking multiple instances of some software into a cluster, doing a database schema migration, or doing a restore of prior backups.

Helm 2 has the concept of hooks to help with actions that happen repeatedly and consistently with every deployment or upgrade process. Helm 3 will be expanding on these concepts, although I don’t yet know the specifics, with Lua as the scripting language to enable this functionality and potentially more.

I am personally conflicted on the inclusion of Lua and what it implies for Helm. While Lua is a lovely scripting language, and likely the best choice for what the developers decided they wanted to do, I think it may end up being a new barrier to adoption for developers outside of the Helm charts/Kubernetes space. Every developer that sits down to use Kubernetes comes with their own biases and comfort with scripting languages. They are often used to Python, Ruby, Javascript, or any of a number of other languages. If Lua becomes an implicit requirement for them to use Helm to accommodate their own operational needs, I suspect that barrier will be significant. Because of this I am hesitant to be excited about the inclusion and focus on using Lua with Helm. What it will ultimately mean in terms of developer accessibility to using Helm and Kubernetes together is yet to be seen. I hope it won’t be an even larger and steeper learning curve.

For the scenario where you want to do periodic, but not consistent, interactions – such as backing up a database or doing a partial restore or recovery – you need to be very aware of the application and its components in their lifecycles. In these scenarios, I have not found a terrific way of using Helm and its hooks to help solve those problems.

Kubernetes itself is only partially effective in this space, having the concept of jobs for one-off batch style mechanics. However, jobs can be darned awkward to use for things like a backup. While I used jobs and continue to try and make them work, I often revert to using kubectl to directly interact with temporary pods to get the work done consistently.

With Helm, I struggled with creating job resources that utilize the same ConfigMap, Secrets, and variables that are used with the charts. Helm is crappy at doing this if you’re using the deploy-with-a-single-command mechanism. An as I mentioned earlier, Jobs can be an awkward fit with the use cases I am trying to accommodate. These scenarios are more “pets” and “one-off” needs where knowledge of the underlying systems and their current state are critical. It may be that operators will ultimately win out for these use cases, but they have a fair way to go yet.

At its heart, this deployment capability that I use implicitly use many times a day also strikes me as the current edition of Helm’s weakest point, and I wonder if it is a crutch that I will ultimately need to replace.

Lessons from Kubernetes for Developers

It’s been a little over six months since Kubernetes for Developers hit the streets. It has been interesting to see the uptake, and where the holes have been from what I first envisioned.

The metrics I receive on book sales are unfortunately wan – nothing so effective as a daily metric for either sales or general engagement. Well, nothing that is shared with me. I was tracking the site with, but they went defunct and shut down, so about the only useful metrics I can see is the Amazon page rank information on the book’s amazon page.

What I learned more organically is that while some developers appreciated the book a surprising number didn’t have the same base level of knowledge that I thought they might. Some of the feedback I’ve received included questions about DNS, how to use linux command line tools, and general confusion about ports and IP addresses. Most of these questions came from people new to development, folks who “heard about Kubernetes” and wanted to know if or how they could take advantage of it.

The technical content for the book is getting low but consistent traffic at GitHub. The python demo application a bit less traffic than the nodejs demo application. No real questions or queries through GitHub, but I think unless you’re reading the book itself, you wouldn’t be aware of the GitHub project.

One area that I wish I delved (and knew) more about before I finished the book was ingress. Although in hindsight, this area is one of the perennial sore spots in Kubernetes – a beta feature for the past 9 releases with no consolidation progress, but with some actually interesting uptake at the edges of the Kubernetes project itself – with related projects blurring lines into service meshes, or being semi-solid commercial implementations over software or physical load balancers.

If I had written anything in depth, it probably would have been best about the stock Nginx ingress controller, as that seems to be about the most default – and the advances from some of the open source projects since I published – such as Heptio’s Contour – have been pretty amazing and interesting.

Another area that’s been a surprising win has been cert-manager, a wonderful tool for publicly hosted Kubernetes clusters to help deploy and manage signed TLS certificates through LetsEncrypt. Since I didn’t get anything useful written about ingress or this lovely gem, I am working on an open source documentation contribution including a quick-start for cert-manager that will hopefully be done and live shortly.

There is also just a huge learning curve for developers to adopt and engage with Kubernetes. The concepts aren’t impossible, but there are a lot of them and the topics are complex and intertwined. Even with the complexity, gaps, and edge cases that bite I am still a huge fan of the project. I see it as the best choice for “software that helps you keep your software up and running” when you can take advantage of it.

Kubernetes for Developers is published!

It’s been a quiet few months on the blog, as all of my writing attentions have been focused on the book project I started back in September of last year. It is now published and available!


If you are so inclined, you can find a copy at, or on Amazon.

When I started this project, one of the things I really wanted to do was work with some editors to improve my writing. And while they didn’t work with my on this book, I learned a tremendous amount from Jennifer Rondeau and Steve Perry, who are technical writers (and editors) at Heptio and Google respectively, both maintainers of the  Kubernetes open source project documentation.

As a side effect of this project, I got involved with the Kubernetes Docs team, helping out here and there and becoming a maintainer myself in the process. Its a wonderfully diverse (and growing) team of collaborators from a bunch of different companies and backgrounds, and as a whole have been incredibly welcoming and helped bolster my knowledge to be able to write the book.

I’m writing a book: Kubernetes for Developers

Quite a number of years ago, I did piece work and later published a book through MacMillan publishing. Being technical books, they are long gone from the shelves as time  made the content irrelevant. Although Gus sometimes likes to poke me saying he found a copy of my book in a used book bin. Strange to think that was 11 years ago.

While spending this summer sabbatical on a lot of traveling, I reflected on what I enjoy. Even while away from computers, I kept my involvement in the Kubernetes project. I find the project extremely compelling, primarily because it works from first principles and builds from there.

What drove me to do the writing is a “You don’t know it until you can teach it” philosophy. And this project is a space I’d like to know, really know. This is what led to deciding to shop around the idea of a book about Kubernetes, written specifically for developers who would otherwise not really interact with it: folks knocking together code in nodeJS or Python that are now getting told “Hey, you get to run this code as well as develop it”: Kubernetes for Developers.

Much of the kubernetes documentation is written for people who are setting up the cluster or running software with it. Most of the people that I see touching on it are the same folks that use Puppet, Chef, Ansible, or Salt. If I were to put a name to this persona, it might be SRE, operations, or classic system administrators. There is some great work (full disclosure: I’m involved in it) happening within SIG-DOCS to make the documentation a lot more relevant and directed to different personas.

I signed a deal with Packt publishing to author Kubernetes for Developers, aimed to be complete in spring 2018. I suspect the space will be pretty crowded by that point.  I have been keeping track of what’s out there: Kubernetes: Up and Running is clearly leading the pack at NovelRank, and there are four or five other books out there – three others from Packt – but so far none focused on developers.

I’m drawing a lot of the ideas for what content to focus on from StackOverflow questions, questions that occur in the Kubernetes Slack, and bugs reported in the project. If you have some opinions on what would be useful to know about or learn, I’m all ears here as well – please leave me a comment, or reach out to me on twitter or github.

I see a lot of  potential in making the architecture and operations of running software far more accessible to developers who have mostly been divorced from it. I think Kubernetes could finally provide a developer-centric Data Center API that I have been hunting for.

My first chapters are blocked out, outline sketched in place to provide some guide rails. I am still trying to get used to Packt’s publishing tools, I definitely prefer using Scrivener by a large margin.


Accustomed to complexity

I have been watching some youtube videos from Alphonso Dunn on techniques for pen and ink sketching. I found a couple of his tutorials online when I was traveling in Europe, getting frustrated at myself while trying to represent trees or (even worse) cobblestones. Let me tell you, cobblestones make up a significant part of landscape themes in smaller European towns!

Alphonso’s videos are so effective because he makes the techniques appear not just simple, but achievable. The real value is that he helps you achieve the same results, and does so by breaking down the techniques into understandable, small steps. The dude is a damned good teacher.

Simplicity is hard.

One of my earliest mentors loved to say “inch by inch, life’s a cinch”. An echo of a pattern I learned in engineering to solve complicated problems – break things down, isolate issues and stumbling blocks, simplify the problem. This provides a way to deal with complexity that’s otherwise overwhelming.

It is easy to expect that if breaking things down into smaller components makes things more tractable, that building things using small simple components will get you an easy to understand system. Unfortunately, the inverse doesn’t work that way – instead you get the joy of emergent complexity.

This phenomenon hits everyone – not just programmers. You can see it in the complexity of insurance, laws, and taxes. Lots of small, simple changes that add up to an almost unbelievably complex beast. It’s also bites most product managers while doing their jobs, or any long running, growing project. Added features and new capabilities interact with all the rest of what you have, sometimes in very unexpected ways. If you’re involved and working with it while it’s growing, this same growth of complexity can be almost invisible. We get so knowledgeable on the specifics and familiar with the details that we become accustomed to the complexity.

The complexity suddenly becomes visible when you step away from the project and come back. It’s can also become apparent when someone new joins a project, or learns about the product for the first time: Usually about the time they’re shouting “What the hell!” These moments are invaluable, as you get a view through different eyes and expectations. This “fresh feedback” is important because we get so used to the weird intricacies and details.

What can you do to work against emergent complexity and keep it simple?

1) work from first principles

The best advice I ever received was to focus on the core, and use that the highlight and refine back down to simplicity. Look at the key problem that you are solving, or trying to solve, or primary action you are enabling. Call it out, highlight it – and keep it forefront in your mind while you review the variations and branches that exist.

Just as important to what you want to solve, take the time to identify what you don’t want to solve. Sometimes this is obvious, but mostly it’s not.

2) maintain clear boundaries of responsibility

If you’re working on a large project, there are probably multiple components to it. This is super common in larger software efforts. Where you have those boundaries, take the time to describe what happens across it. Take the time to write down and make sure there is common agreement on what a component needs to do, and what it’s responsible for. Often adding what appears to be a simple feature will or confuse these responsibilities, and you’ll have corner cases you don’t anticipate. It will evolve, so don’t think once something is agreed upon it’s static. Make the time to review and validate these as you go along, understanding if original needs and assumptions have changed enough to warrant revision – and then update it.

3) search out fresh eyes

Actively work to get the feedback from fresh eyes. Find folks who aren’t familiar with what you’re making to look at what it does, or even how it does it. This is one of the most effective ways to highlight what is confusing, or where complexity resides. This can be formal, but often doesn’t need to be – even informal conversations, or a quick demo and getting reactions can lead to understanding where the complexity resides and what is confusing.

4) prune and grow

Acknowledge that complexity, change, removal and growth are all part of a healthy, living project. Include a periodic retrospective that reviews what all the assumptions, and actively trim or remove the options to hew to the core of what you want your project to do. Make the time, and intentionally trim as well as grow as you go.

I’m writing this for you, I just don’t know who you are…

I really enjoy traveling, and I’d say “and meeting new people”, but I’m a fairly introverted fellow in person, and tend to hang back a bit. None the less, I watch and I listen. I spent the earlier part of the summer in western Europe – where my knowledge of the language is not even rudimentary: dutch, german, french, danish, icelandic… fortunately tone,  body language, and the immense pervasiveness of english as a trade tongue made it possible.

I am now back in the US, doing a bit of wandering and traveling here. Stomping around in the midwest as I write this post, mostly doing the same – listening, watching, learning. There’s as much to learn next door as there is across the ocean.

I started “this blog” – which has moved a few times, and been re-hosted two or three times as well – almost 16 years ago when I moved from the midwest. Originally a way to “keep in touch” with friends, a sort of public journal. Since then personal writing has morphed to technical writing, the me of now self-censors very differently than the “me of 2000”, and the audience has both grown tremendously and shifted around. RSS news readers have given way to cross posting in social media or just sharing content through search engines. I do post a lot more directly in twitter or facebook, even with their faults and strageness, than I do here. I have a account, but haven’t yet really engaged in using it to any depth.

I am thinking about the content here, because as I was looking at the “metrics and analytics” I was struck (not for the first time) how much I write for people that I haven’t met. It’s kind of cool – well, “astoundingly cool” really – to write for people I may never even meet.

I expect the majority of people find my content through a search engine, and even a decade ago I was writing with the idea that search engines would index this stuff and make it available. That Google or it’s competitors would add this to the wealth of knowledge and opinions accessible. But planning for it doesn’t really resonate the same way as seeing the actual reach.

This is a snapshot view of just August of this year – what people are reading and viewing of my writing.


I never would have expected to have people from germany, india, china and france reading my words. Hopefully getting something useful from it. Other than these analytic trails, it’s hard to see the evidence.

The post that tops even the current views was one I wrote 18 months ago, and didn’t seem to become popular for a good six months after I wrote it. For quite a number of years, the highest viewed post was one I wrote in 2009 – a walk-through example of setting up CI for a python project with Jenkins (well, Jenkins was known as “hudson” at the time) that I created as a “thank you” for the author of coverage – open source project.

My posts here are for me, and for you – I just don’t know who you are. Not yet anyway, and maybe I never will – hard to say. I think it’s a worthy endeavor, and there are a lot of topics I’m interested in learning and sharing. If you do happen across this and want to say “hi” – I’d love to hear from you, find me on, github, stackoverflow, twitter, facebook, or whatever. A sort of pen-pal across time I guess – anyway, I like hearing the reflections.

Does the game of chess show us a map to future jobs?

I have been interested in the field of Artificial Intelligence since the 80's – pretty much the heart of the "AI winter". Like many others, I remember when Kasparov lost to the IBM DeepBlue Chess AI in the 1997, and watched in fascination as DeepMind's AlphaGo beat Lee Sodel in Go in 2015.

At this point, the Kasparov defeat is sort of in the history books, as everyone is focused on the AlphaGo wins and the follow technology uses that it employed. Kasparov's interactions are kind of ignored, but they shouldn't be. What happened since that defeat with the evolution of the game of chess, with human and AI opponents, is fascinating – and I think informative.

In the wake of chess AI getting so strong, a style of chess developed called Freestyle chess, and from that – a "Centaur" model appeared. It's a team of human and AI system working together to play the game – and it's immensely powerful. There's an article from 2010 that outlines more of Kasparov's views, and it's worth a read if you're interested. I'm also certainly not the first to see this. There's articles in the Huffington Post about Centaur Chess, the BBC in 2015, and TechCrunch from last year that draw parallels.

The gist of the thesis is that current AI and human cognition aren't strong in the same areas, and each tackle the solution of tasks in very different ways. If you combine these two systems in a manner that exploits the strengths of each, you get massive capability and productivity improvements. It is exactly this pattern that I see as the future for quite a number of jobs.

I don't know exactly what form this will take, but I can extrapolate some obvious pieces – two attributes that will show up more and more.

The first is "explainability". Somewhere tucked in between psychology, cognitive sciences, and art of user experience and design is the place where you get amazing transfer of information between computing systems and humans. If you have an AI system predicting or analyzing something, then sharing that information effectively is critical to having effective coordination. Early examples of this are already prevalent in cars, planes, and military combat vehicles, which have had even more of this kind of investment. Head's up displays (HUDs) are so relevant that they're the common staple for video games today. And with the variety of first-person shooters, we're even training people en-masse to understand the concepts of assistive displays.

But don't get too focused on the visual aspects of explainability – the direction haptics and sound that work with the Apple Watch while navigating are another example – very different from a head's up display, but just as effective. Conversational interfaces like you're seeing with Alexa, Siri, or Google Home are all steps to broadening interactions, and the whole field of affective computing reaches into the world of understanding, and conveying, information at an emotional level.

Because it's so different from how AI has been portrayed in movies and the "public opinion", some folks are trying to highlight that this is completely different calling it Intelligence Augmentation, although that seems to be a corporate marketing push that hasn't gained too much traction.

The second is collective intelligence. This used to be more of a buzzword in the late 90's when some of the early successes really starting appearing. The obvious, now common, forms of this are recommendation engines, but that's really only the edge of where the strengths lie. By collecting and collating multiple opinions, we can basically leverage collective knowledge to help AI systems know where to start looking for solutions. This includes human bias – and depending on the situation, that can be helpful or hurtful to the solution. In a recommendation engine, for example, it's extremely beneficial.

I think there's secondary forms of this concept, although I haven't seen research on it, that could be just as effective: using crowd source knowledge to know when to stop looking doing a path, or even category, of searches. I suspect a lot of that kind of use is happening right now in guiding the measure of effectiveness of current AI systems – categorization, identification, and so forth.

There is a lot of potential in these kinds of systems, and the biggest problem with them is we simply don't know how to best build and where to apply these kinds of systems. There's a lot of work still outstanding to even identify where AI assistance could improve our abilities, and even more work in making it's assistance easy to use and applicable to wide diversity of people and jobs. It's also going to require training, learning, and practice to use any of these tools effectively. This isn't downloading "kung fu" into someone's head, but more like giving Archimedes the lever he wanted to move the world.