SceneKit interaction handling – Experiment439

A staple of science fiction movies has been 3D holographic visualizations and controls. Most efforts I’ve seen at taking real visualization data and putting them into a 3D context haven’t been terribly successful. At the same time, the advance of AR and VR makes me suspect that we should be able to take advantage of the additional dimension in displaying and visualizing data.

I started a project, named Experiment439, to go through the process of creating and building a few visualizations and seeing what I can do with them, and what it might be refined out into a library that can be re-used.

I wanted to take a shot at this leveraging Apple’s SceneKit 3D abstraction and see how far I could get.

The SceneKit abstraction and organization for scenes is a nice setup, although it’s weak in one area – delegating interaction controls.

The pattern I’m most familiar with is the View Controller setup (and it’s many variants depending on how you display data). Within SceneKit, an SCNNode can encapsulate other nodes (and controls overall placement in the view), so it makes a fairly close analogue to the embedding of views within each other that I’m familiar with from IOS and MacOS development. Coming up with something that encapsulates and controls a SCNNode (or set of SCNNodes) seems like it’s a pretty doable (and useful) abstraction.

The part that gets complicated quickly is handling interaction. User-invoked events in SceneKit today are limited to projecting hit-tests from the perspective of the camera that’s rendering the scene. In the case of AR apps on IOS for example, the camera can be navigating the 3D space, but when you want to select, move, or otherwise interact you’re fairly constrained to mapping touch events projected through the camera.

I’ve seen a few IOS AR apps that use the camera’s positioning as a “control input” – painting or placing objects where the IOS camera is positioned as you move about an AR environment.

You can still navigate a 3D space and scene, and see projected data – both 2D and 3D very effectively, but coming up with equivalent interactions to what you get on Mac and IOS apps – control interactions – has been significantly trickier.

A single button that gets toggled on/off isn’t too bad, but as soon as you step into the world of trying to move a 3D object through the perspective of the camera – shifting a slider or indicating a range – it gets hellishly complex.

With Apple’s WWDC 2019 around the corner (tomorrow as I publish this) and the rumors of significant updates to AR libraries and technologies, I’m hoping that there may be something to advance this space and make this experiment a bit easier, and even more to expand on the capabilities of interacting with the displayed environment.

IOS AR apps today are a glorified window into a 3D space – amazing and wonderful, but heavily constrained. It allows me to navigate around visualization spaces more naturally than anything pinned to a desktop monitor, but at the cost of physically holding the device that you would also use to interact with the environment. I can’t help but feel a bit of jealousy for the VR controllers that track in space, most recently the glowing reviews of the Valve Index VR controllers.

Better interaction capabilities of some kind will be key to taking AR beyond nifty-to-see but-not-entirely-useful games and windows on data. I’m hoping to see hints of what might be available or coming with the Apple ecosystem in the next few days.

Meanwhile, there still a tremendous amount that to be done to make visualizations and display them usefully in 3D. A lot of the inspiration of the current structure of my experiment has been from Mike Bostock‘s amazing D3.js library, which has been so successful in helping to create effective data visualization and exploration tools.

IOS Dev Diary – using UIDocument

I have been working on an artist utility app, with the primary purpose to present an image and super-thin grid overlay. The inspiration came from the cropping functionality in the Photos app – but that’s very ephemeral to a the act of croping an image, and isn’t easily viewable on a continued basis (such as on an iPad) when you want that grid to support your sketching or painting. Using a grid like this is done for a couple of purposes: one of which is the “process by Leonardo” for helping to capturing and copying an image by hand. The other to double check the framing and composition against what’s called The Rule of Thirds.

I originally didn’t think of this as an application that would have or use a document format, but after trying it out a bit and getting some feedback on the current usage, it became abundantly clear that it would benefit tremendously by being able to save the image and the framing settings that shows the grid overlay. So naturally, I started digging into how to really enable this, which headed directly towards UIDocument.

Using UIDocument pretty quickly begged the question of supporting a viewer for the files, which led to researching UIDocumentBrowser, which was a rather surprisingly invasive design change. Not bad, mind you – just a lot of moving parts and new concepts:

  • UIDocument instances are asynchronous – loading and saving the contents is separate from instantiating the document.
  • UIDocument’s support cloud-hosted services from the get-go – which means they also include a concept of states that might be surprising including inConflict and editingDisabled in addition to reflecting loading, saving, and error conditions while doing these asynchronous actions.
  • UIDocument is built to be subclassed, but how you handle tracking the state changes & async is up to you.
  • UIDocumentBrowser is built to be controlled through a delegate/controller setup and UIDocumentControllerViewController which is subclassed, and also demands to be root of the view hierarchy.

Since my document data included UIImage and UIColor, both of which are annoying when trying to persist them using struct coding with swift , I ended up using NSKeyedArchiving, and then later NSSecureCoding, to save out the document.

One of the first lesson I barked my shin on here was when I went to make a ThumbnailPreview extension that loaded the document format and returned a thumbnail for the document icon. The first thing I hit was that NSKeyedArchiving was failing to load/decode the contents of my document when attempting to make the thumbnail, while the application was able to load and save the document just fine. It likely should have been more obvious to me, but the issue has to do with how NSKeyedArchiving works – it decodes by class name. In the plugin, the module name was different – so it was unable to load the class in question, which I found out when I went to the trouble of adding a delegate to the NSUnarchiver to see what on earth it was doing.

One solution might have been to add in some translation on NSKeyedUnarchiver to translate the class to the module name that was associated with the plugin setClass(_:forClassName:). I took the different path of taking the code that represented my document model and breaking it out into it’s own framework, embedded within the application – and then imported that framework into the main application and the preview plugin as well.

UIDocument Lesson #1: it may be worth putting your model code into a framework so plugins and app extensions can use it.

The second big “huh, I didn’t think of that…” was in using UIDocument. Creating a UIDocument and loading its data are two very separate actions, and a UIDocument actually has quite of bit of state that it might be sharing. The DocumentBrowser sample code took the path of making an explicit delegate structure to call back as things loaded, which I ended up adopting. The other sample code that Apple provided (Particles) was a lot easier to start with and understand, but doesn’t really do anything with the more complex world of handling saving and loading, and the asynchronous calls to set all that up.

UIDocument Lesson #2: using a document includes async calls to save, load and states that represent even potential conflicts when the same doc is editing at the same time from different systems.

One particularly nice little feature of UIDocument is that it includes a Progress property that can be handed and set on the UIDocumentBrowser’s transition controller when you’ve selected a document, so you get a nice bit of animation as the document is loaded (either locally or from iCloud).

UIDocumentBrowser Lesson #1: the browser subclass has a convenient (but not obvious) means of getting a animated transition controller for use when opening a document – and you can apply a UIDocument’s Progress to show the loading.

The callbacks and completions were the trickiest to navigate, trying to isolate which view controller had responsibility for loading the document. I ended up making some of my own callbacks/completion handlers so that when I was setting up the “editor” view I could load the UIDocument and handle the success/failure, but also supplied the success/failure from that back to the UIDocumentBrowserViewController subclass I created to support the UIDocumentBrowser. I’m not entirely convinced I’ve done it the optimal way, but it seems to be working – including when I need to open the resulting document to create a Quicklook Thumbnail preview.

The next step will be adding an IOS Action Extension, as that seems to be the only real way that you can interact with this code directly from Photo’s, which I really wanted to enable based on feedback. That will dovetail with also allowing the application to open image based file URLs and create a document using that image file as its basis. The current workflow for this application is creating a new document, and then choosing an image (from your photo library), so I think it could be significantly simpler to invoke and use.

IOS Dev Diary – accessibility quirk with “Bold Text”

I just worked around a surprisingly tricky bug. The “Bold Text” accessibility feature in IOS has some really surprising impacts – it changes the rendering of images, specifically toolbar images.

In the IOS app I’m working on, I have a toolbar icon that I made in vector format (PDF). In the back and forth of working on the app, I set the global tint of the storyboard to white. The toolbar button was set to only use the vector image, and its tint is set to black. When I ran the application, the toolbar button showed up just fine, and in the tint (black) I selected for the button.

When one of my testers ran the app, the toolbar button “disappeared”. It was still there, but rendering white on the white toolbar. It took a while to figure out that the difference between our environments: that Bold Text was enabled in accessibility. Then it took a while longer to find that it wasn’t respecting the local tint, but using the global tint when enabled.

That “Bold Text” being enabled effected the image rendering came as a surprise to me. Some friends indicated they’d seen significant performance issues with Bold Text as well (in cells in a tableview), so they knew that it impacted image rendering – I guess it does something to try and make an image “bolder”, even though it’s not text. (I’m unable to perceive the visual difference in rendering the vector image)

It turns out that Accessibility Inspector also doesn’t reflect this setting. To try it out in the simulator, you need to go to the settings in the simulator directly and enable it. Fortunately, it does reflect the changes once enabled. (radr://49752183) UPDATE: marked as a duplicate of radr://49301632

Once I found that it was using the global tint, it was easy to set that global tint to something more sane (black in my case), so the workaround was very easy once I found it. Fortunately, the sample code for the bug report was equally easy. (radr://49752053)

In the end, I came away with a new “launch ready” checklist item:

  • review all the tints in the storyboard and make sure they’re consistent.

Back to NetNewsWire

I started with RSS and NetNewsWire as an aggregator quite a while ago to keep up with the blogs and other various information sources I wanted to follow. It was the most effective way of keeping up with the developer communities I was interested in. Things progress, change, and generally move – and I moved with RSS to using Google’s Reader – which was really a lovely solution, in that I had a sync’d view of what’s I’d read regardless of the device I was using. Then in 2013, they shut it down.

I was disappointed, but not angry. I was getting a lot of connected news stories from Twitter, LinkedIn, some email newsletters, and even a touch through friends on Facebook. Fast-forward to 2019 and the state of social media has devolved so much that I can’t reliably find recent updates – the timelines aren’t timelines, instead having morphed into tuned and algorithmically calculated ad-feeders. I suppose it was inevitable – trusting those sources to find and gather information, it’s a natural place to monetize with advertising, so of course the providers will optimize that.

A month ago I started the “purge these assholes” from my social media feeds, which was mostly successful. After I stopped following a number of hyperbolic-tending sources, the streams were better. They still didn’t help me learn and find new information – they still weren’t what I wanted and once had.

I was at the Xcoders meetup a month ago, and getting back into doing some IOS and Mac development projects. I knew that Brent had been quietly working on Evergreen, and that recently transformed/renamed to NetNewsWire – now open source and with a working build. It is a development build – so I fully expect things might break, not work, or otherwise have holes, but it was a no-brainer for me. Now it’s installed, in my dock, and getting daily use.

I’m relieved to have a news source that

  • is only filtering what I want, when I want
  • supports the open web
  • isn’t brutally promoting ads into my face.

I’m happy to sort and filter through all the various sources. In fact, I even went through all the blogs listed in IOSDevDirectory and made an IOS Dev OPML file for myself. If you’re so inclined in that direction, feel free to grab it and use it yourself.

IOS 12 DevNote: Embedded Swift Frameworks and bitcode

A side project for the barista’s at my favorite haunt has been a fun “getting back into it” programming exercise for IOS 12. It’s a silly simple app that checks the status of the network and if the local WIFI router is accessible, and provides some basic diagnostic and suggestions for the gang behind the counter.

It really boils down to two options:

    Yep, probably a good idea to restart that WIFI router
    Nope, you’re screwed – the internet problem is upstream and there’s nothing much you can do but wait (or call the Internet service provider)

It was a good excuse to try out the new Network.framework and specifically NWPathMonitor. In addition to the overall availability, I wanted to report on if a few specific sites were responding that the shop often uses, and on top of that I wanted to do some poking at the local WIFI router itself to make sure we could “get to it” and then made recommendations from there.

As I dug into things, I ended up deciding to use a swift framework BlueSocket, with the idea that if I could open a socket to the wifi router, then I could reasonably assume it was accessible. I could have used Carthage or CocoaPods, but I wanted to specifically try using git submodules for the dependencies, just to see how it could work.

With XCode 10, the general mechanism of dragging in a sub-project and binding it in works extremely easily and well, and the issues I had really didn’t hit until I tried to get something up to the IOS App Store for TestFlight.

The first thing I encountered was the sub-projects had a variable for CFVersionBundle: $(CURRENT_PROJECT_VERSION) that apparently wasn’t getting interpolated and set when it was built as a subproject. I ended up making a fork of the project and hard-coding the Info.plist with the specific version. Not ideal, but something that’s at least tractable. I’m really hoping that this coming WWDC shows some specific Xcode/IOS integration improvements when it comes to Swift Package Manager. Sometimes the Xcode build stuff can be very “black box”, and it would be really nice to have a more clear integration point for external dependencies.

The second issue was a real stumper – even though everything was validating locally for a locally built archive, the app store was denying it. The message that was coming back:

Invalid Bundle – One or more dynamic libraries that are referenced by your app are not present in the dylib search path.

Invalid Bundle – The app uses Swift, but one of the binaries could not link to it because it wasn’t found. Check that the app bundles correctly embed Swift standard libraries using the “Always Embed Swift Standard Libraries” build setting, and that each binary which uses Swift has correct search paths to the embedded Swift standard libraries using the “Runpath Search Paths” build setting.

I dug through all the linkages with otool, and everything was looking fine – and finally google trawled across a question in StackOverflow. Near the bottom there was a suggestion to disable bitcode (which is on by default when you upload an IOS archive). I gave that a shot, and it all flowed through brilliantly.

I can only guess that when you’re doing something with compiled-from-swift dylib’s, the bitcode process does something that the app store really doesn’t like. No probably without the frameworks (all the code in the project directly), but with the frameworks in my project, bitcode needed to be turned off.

Made it through all that, and now it’s out being tested with TestFlight!

El Diablo Network Advisor

Vapor 3 and a few random experiments

This past week I dug more deeply into server-side swift, specifically with Vapor 3. Vapor was interesting because it recently built over SwiftNIO, and initial reports of its performance were very positive. A highly performant HTTP application based framework in a memory safe language? Worth a look!

I have used dynamically typed languages (NodeJS/TypeScript/Javascript and Python) for quite a while, so the biggest shock is transferring back towards the constraints of a strictly typed language. This cascades into how the software is represented at a lot of levels, and really the transfer to “classes, structs, and enums” was the hardest to re-acquaint myself. The piece that feels the weakest (compared to other languages and frameworks) is the testing – the dynamic languages uses the full capabilities of the languages dynamism, and it’s brutally missing from swift. I have become immensely spoiled using testing frameworks like Jasmine, or Mocha and Chai with supertest over express, making a super-easy to read testing framework that works the code directly.

Speaking of BDD, I took a day detour into even trying to use Quick and Nimble, but in the end decided it was more pain than value – and leveraging XCTest, even if writing tests with it felt stunted and somewhat awkward, was a more robust path. It was particularly painful to work with server-side swift, it seems far more robust with IOS projects – but the lack of reflection and XCTest identifying what to run on Linux is atrocious. When I found the SwiftPM command to help collect the tests with XCTest:

swift test --generate-linuxmain

that won the day, and it was back to XCTest.

Vapor 3 itself was very straightforward, although the docs are very rough – and in some cases downright useless. There are multiple points where extensions to Vapor (WebSocket, Auth, and so forth) are not clear on how you attach and use them in their templates and sample code. Fortunately the community (on Discord, rather than Slack and Vapor on StackOverflow) makes up for the difference. The developers who are actively pushing Vapor forward as well as community members are very accessible and willing to answer questions.

As I mentioned earlier, I’m finding the idioms of programming with swift the hardest to get my head around. It is a very different way of thinking about the problem and how to solve it, tending to be fully specified at every level. Structs, extensions, and enums make up most of the structures, often with lots of smaller files in the examples that I’ve been seeing so far.

While it’s very straight forward to read and understand, I find myself struggling to know where to look things up, and how to read documentation to get what I need out of it. In addition, even Apple’s documentation seems significantly weaker than it has in years past. There’s a new style to the documentation that I’m struggling to learn – the ability to read the docs to know what enumeration options should be used, how and when, is definitely a challenge. It is often Q&A and samples in StackOverflow that provide the closing hints or how to use the code in any holistic way that makes a difference.

On the good side, Xcode running Vapor on my laptop was a gem, and I was immediately enthralled with the cpu and memory tracking details that you could see while running the code locally. I haven’t fully explored what you can do, but even just seeing the live CPU and memory tracking on the Vapor application while it’s running is wonderful. In other environments, there would be a lot of infrastructure setup to capture that same level of immediate detail – and it’s just built in with Xcode.

CPU spikes when running “ab” load testing
memory usage over time with the same “ab” load testing

Vapor makes it easy to leverage Xcode, wrapping the SwiftPM tool commands so you can invoke something akin to:

vapor xcode -y

This will regenerate an Xcode project file and open it. Vapor projects also have a number of examples of wrapping the code into a container to run however you like, and the next version of Vapor (4, in development now) will have some “polite shutdown” signal handlers for SIGINT and SIGTERM, which will make it work better with orchestration systems like Kubernetes.

I have this perverse idea of wanting to run this same code that I can put into a container on an IOS device for a quick-shot “mobile server”. Yes – I know there are issues with IOS and activating the relevant devices through SwiftNIO, but the idea of having my own portable server as an IOS app is really appealing.

Vapor 3 is all based on Swift Package Manager, for which there’s no (yet) direct Xcode support. It looks like it may be possible to use Xcode’s cross-project linking to have an SPM based Xcode project working with a more classic IOS one using the project as a dependency. There’s an article on how this can work called Bringing Swift NIO to the iPhone, and a similar reference, a walk-through how-to in the swift forums. I haven’t wrapped my sample Vapor 3 project into an IOS application yet, but I’ll be giving that a shot shortly.

Adding tracing with Jaeger to an express application

I have been following distributed tracing technologies – Zipkin, OpenTracing, Jaeger, and others – for several years, without deeply trialing with any of them. Just prior to the holidays, we were having a number of those “why is this slow?” questions about an express application, written in typescript, providing an API endpoint. The API fronts multiple different data sources – MongoDB, InfluxDB, and Redis, and we run it all in containers, deployed and orchestrated with Kubernetes.

I decided that with some quieter time with the holiday, this was the perfect time to experiment with and take a stab at seriously implementing some form of tracing and external presentation to shed some light on what was happening within the API endpoint, so we could have some good conversations about the scope and structure of what’s being returned in the various implemented API endpoints.

I wrote about Jaeger (and OpenTracing) in Kubernetes for Developers, which provided me with some passing familiarity. In addition to prior research, I choose to go that route again to see what we could learn. A new complication to the scenario was that we wrote our API server using TypeScript. While the concepts are well documented – there aren’t any details tutorials or examples showing how to enabling tracing in your typescript/express application. My goal with this article is to provide one of those detailed example that may help to provide a roadmap (or highlight potholes) for others who are interested in doing the same.

The resources that I used to source and start the learning included the book I wrote (Kubernetes for Developers), and also:

  • – more marketing than detail, but provided a starting point for finding other resources and hinting at where to look
  • – the source repository for the NodeJS jaeger client implementation, primarily the README (outside of when I needed to look at how something was implemented) for developer level notes about the implementation details.
  • – the real source for a lot of the starting points, with concrete implementation examples, even if they didn’t always follow the same patterns I needed. YuriSkuro did some lovely work with providing these examples. I didn’t hassle him in learning or using any of this, so any misinterpretation of what was expected is mine.

Getting a toe hold and seeing a trace

The first step in my journey was to get tracing enabled and see any single result. There are a lot more examples for Go than Express/NodejS. Adding in TypeScript and getting the libraries to work seamlessly within the type-constrained interfaces wasn’t entirely obvious.

The client itself installs cleanly with npm or yarn, as does opentracing, so getting the libraries was as simple as:

npm install opentracing jaeger-client --save

The jaeger client for node was written using Flow, which is a different pattern of adding in type constraints on top of Javascript, and doesn’t translate directly to TypeScript. Fortunately, it implements OpenTracing, and the OpenTracing offers a typescript typed definition library, so it was possible to load that and use OpenTracing’s definitions with jaeger clients implementation.

In the end, I get this operational in typescript with:

import * as opentracing from 'opentracing';

From there, you can set up the overall tracing client to work with your code. The tutorials inspired some of this, but the gist of the setup is to define an initialization function that passes in the tracer configuration, including sampling and reporting of when traces are captured. For getting a toe-hold and just starting,

const initJaegerTracer = require('jaeger-client').initTracer;
function initTracer(serviceName: string) {
const config = {
serviceName: serviceName,
sampler: {
type: 'const',
param: 1,
reporter: {
logSpans: false, // this logs whenever we send a span
const options = {
logger: {
info: function logInfo(msg: string) {
console.log('INFO ', msg);
error: function logError(msg: string) {
console.log('ERROR ', msg);
return initJaegerTracer(config, options);

export const tracer = initTracer('my-service-api') as opentracing.Tracer;

I chose to use the ‘const’ sampler (implying “trace everything”), and while I had logSpans set to true to see how it was working, I switched it to false as soon as I saw it was operational, as it was generating a huge amount of additional console logging output that was cluttering up my output.

logger in this function looks like it was built so that you could pass in any of a variety of logging functions (an example of which might be Winston). For getting started, I just stuck with some simple console.log statements.

Once I had the tracer set up, then I could use that to create spans, annotate them, finish them, and let the jaeger-client library take care of dealing with getting it outside of my code and to it’s external destination. This also implies there’s somewhere to capture and display this information, and the JaegerClient getting started documentation came in useful here.

JeagerClient sets up and offers an all-in-one memory store and UI for collecting and visualizing traces that’s perfect for running in a docker container right on your development machine.

I used the example directly, exposing all the ports (even though I only use a few) and it fired right up:

docker run -d --name jaeger \


 -p 5775:5775/udp \

-p 6831:6831/udp \

-p 6832:6832/udp \

 -p 5778:5778 \

 -p 16686:16686 \

-p 14268:14268 \

 -p 9411:9411 \


Something to note (maybe obvious, but I’ll mention it): to kill off this background container, you use the command:

docker kill jaeger

and before you try and run it again, you may want to “clean it up” and remove it. You can do that with the command:

 docker rm /jaeger

Once it’s up and running, you can access port 16686 on localhost (open http://localhost:16686) to see the jaeger UI (it is boring until you get traces into it)

With an endpoint ready to receive, you can start to create spans and send them in. YuriSkuro’s tutorial code (lesson 2) really cemented this piece for me, and it was relatively light work to translate it to using with TypeScript.

I started off by picking one express route that we’d already implemented and enabling it with a trace. Setting up a trace was pretty straight forward, and the pattern quickly emerged for how I’d repeat this, so it moved into a function to simplify the code in the route.

Creating a span is primarily calling tracer.startSpan. When you’re doing that, you can create the current span as stand-alone, or you can build it up with references to other spans. This can either be another span reference directly, or you can extract the span context information from somewhere else and use it. For express routes, there are some handy helper utilities to look for and pull a tracing context out of the HTTP request headers, so the logic I resolved upon here was to try and pull that context, and if that does not exist, then go ahead and create a stand-alone span.

I’m still experimenting with what and how much to annotate onto spans, and so far I’ve chosen to use the function names from the routes as the span names, and to annotate the controller for a bit of grouping.

export function createControllerSpan(controller: string, operation: string, headers: any) {
let traceSpan: opentracing.Span;
// NOTE: OpenTracing type definitions at
// <>
const parentSpanContext = tracer.extract(opentracing.FORMAT_HTTP_HEADERS, headers);
if (parentSpanContext) {
traceSpan = tracer.startSpan(operation, {
childOf: parentSpanContext,
tags: {
[opentracing.Tags.SPAN_KIND]: opentracing.Tags.SPAN_KIND_RPC_SERVER,
[opentracing.Tags.COMPONENT]: controller
} else {
traceSpan = tracer.startSpan(operation, {
tags: {
[opentracing.Tags.SPAN_KIND]: opentracing.Tags.SPAN_KIND_RPC_SERVER,
[opentracing.Tags.COMPONENT]: controller
return traceSpan;

With the span created, the other critical piece is to report back when the span should be marked as completed. The function finish() on a span can be invoked directly, but it quickly became apparent that it was handy to also annotate the return result code and if it should be considered an error. This ends up being more handy in the UI than anything else, letting you filter down to “interesting” traces.

I ended up also encapsulating that into a function:

export function finishSpanWithResult(span: opentracing.Span, status: Number, errorTag?: boolean) {
span.setTag(opentracing.Tags.HTTP_STATUS_CODE, status);
if (errorTag) {
span.setTag(opentracing.Tags.ERROR, true);

I’m not particularly thrilled with the errorTag?: boolean optional on the tail end of this function, I’m sure there’s more aesthetic ways to handle it. I’m still prone to think of function signatures with named variables that is common from other languages (python), so take my specific implementation with a grain of salt.

Between these two functions, I have the pieces to create and finish spans, which results in code within an express route that looks something akin to:

import { createControllerSpan, finishSpanWithResult } from '../helpers/tracing';
import * as opentracing from 'opentracing';

# ...

# within the express route function:
const traceSpan = createControllerSpan('myController', 'doSomethingCool', args.headers);
# if something not found
finishSpanWithResult(traceSpan, 404);
return res.status(404).send();
try {
# do work here
# resulting in a resultList object that we'll return:
finishSpanWithResult(traceSpan, 200);
return res.status(200).json(resultList);
} catch (error) {
finishSpanWithResult(traceSpan, 500, true);
console.log('error while listing things ', error);
return res.status(500).send();

With this in place, and the code’s previously existing tests (which leverage supertest to work the API endpoint), just running the unit tests was generating traces, which turned out to be immensely useful.

I’m fairly confident that the logic that I put into my routes could have been encapsulated within an express middleware, and we may take that path in the future. We are also using Swagger 2.0 as a definition layer, and the swagger-express client tooling sets up each of the operationId as middleware functions, so it was not as convenient to try that with our particular setup.

Adding this code in to every route was frankly pretty tedious, but I also took advantage of the time it took to normalize the pattern of each of routes to a more consistent structure, and while it was a bit time consuming, it wasn’t hard and the end result was structurally more consistent routes between all the controllers.

This level of implementation gets you a view of “how long” an API call took, but not the detail within it of why it took so long. A necessary first step, it was valuable in itself but the overhead of Jaeger and OpenTracing to get just this information was akin to swatting a fly with a flamethrower. The real value comes from extending the tracing so you can see what within the API call took time, and how long it took.

Datagram size problems

Just this level of tracing was enough to highlight a quirk that’s worth calling out. While implementing and running the tests to make sure I didn’t break anything, I started seeing a repeated error message in the unit test console output, starting with EMSGERROR. After some digging, it turns out that this error message was being caused by MacOS’s default datagram size being MUCH smaller than the jaegertracing libraries expected.

It boils down to how JaegerClient chose to transmit data and keep the process of collecting and transmitting this additional as light as possible. The client collects the traces and then manages the transmission of them in the background using the UDP network protocol. This is basically a “fire and forget” data packet, the packet of which is called a Datagram, and the size of which is set by your operating system. The default on Linux (and most Unix OS) is 65,536, but the built-in default that MacOS uses is 9216 bytes. The error that was appearing was JaegerClient reporting that it was unable to compose and send the traces because it always expected to use the larger datagram sizes.

The solution was fairly straightforward – tell the local OS to allow those datagrams to be quite a bit larger (match what Linux was doing). On MacOS you can view the size it’s currently set to with the command:

sysctl net.inet.udp.maxdgram

and you can set it with:

sudo sysctl net.inet.udp.maxdgram=65536

Once I did that, the error messages disappeared, and I saw significantly more traces hitting Jaeger that I could dig through.

But what is it doing?

At this point, we had nice visualization of the API call timing, but I wanted to leverage the wins of tracing with adding sub-spans to break up the work that was happening. Our API code works with multiple external data resources, so that was the natural place I wanted to wrap with tracing.

We are using Mongoose and it’s schema, as well as the Influx node client libraries to access these remote resources. I’m sure with more work I could have made the wrapping less intrusive, but with a few additional helper methods I was able to quickly get 90% of the value and view into “what’s taking how much time”.

First I made a method to initiate a span within a mongoose Schema method (we use the virtual methods within Mongoose to consolidate our data models in the code):

export function createSchemaSpan(schemaName: string, operation: string, parentSpan?: opentracing.Span) {
if (parentSpan) {
return tracer.startSpan(operation, {
childOf: parentSpan,
tags: {
[opentracing.Tags.SPAN_KIND]: opentracing.Tags.SPAN_KIND_RPC_SERVER,
[opentracing.Tags.COMPONENT]: schemaName
} else {
return tracer.startSpan(operation, {
tags: {
[opentracing.Tags.SPAN_KIND]: opentracing.Tags.SPAN_KIND_RPC_SERVER,
[opentracing.Tags.COMPONENT]: schemaName

The logic follows the same pattern as the express route span creation, but instead of taking headers apart the looking for a span context, it’s set up to accept a parent span, which is much more readibly available at the level of the mongoose virtual method function call. I then appended an optional parentTraceSpan parameter into each of those function calls (which are typically used form the express routes), and used that with this function to set up the span for any work that happened, and continue the cascade down.

The other place that was getting some MongoDB time was doing document queries. This got a little awkward with TypeScript, mostly because I did not feel super confident in how generics work and parameterizing functions like this, so I took a more ugly shortcut and did some down-casting of the types, and re-cast them on result, so that I could write a wrapper that consistently traced the call.

The end result is this function:

export async function traceMongoQuery(parentSpan: opentracing.Span, traceName: string, documentQuery: DocumentQuery<any, any>) {
const traceSpan = tracer.startSpan(traceName, {
childOf: parentSpan,
tags: {
[opentracing.Tags.SPAN_KIND]: opentracing.Tags.SPAN_KIND_RPC_SERVER,
[opentracing.Tags.COMPONENT]: 'mongodb'
const documentResult = await documentQuery;
return documentResult;

which ends up getting used in a pattern like:

const possibleSite = await traceMongoQuery(traceSpan, 'Site.findOne', Site.findOne({ name: siteId })) as ISiteModel;

It means that when I trace the query, I need to know what should be coming back out so I can cast it back into the right object model, which is a little stupid, but kept everything working. Certainly not my proudest moment, but it’s functional.

The other data source we use is Influx, and a separate method when into wrapping those queries, which had a less heavily typed setup than what we had with MongoDB and mongooose in TypeScript.

The function I used to wrap the influx queries:

// return type for an influx query is Promise<influx.IResults<{}>>
export async function tracedInfluxQuery(
query: string,
options?: influx.IQueryOptions,
parentSpan?: opentracing.Span,
spanName: string = 'influx-query'
): Promise<influx.IResults<{}>> {
let traceSpan: opentracing.Span;
if (parentSpan) {
traceSpan = tracer.startSpan(spanName, {
childOf: parentSpan,
tags: {
[opentracing.Tags.SPAN_KIND]: opentracing.Tags.SPAN_KIND_RPC_SERVER,
'db.type': 'influxdb',
'db.statement': query
} else {
traceSpan = tracer.startSpan(spanName, {
tags: {
[opentracing.Tags.SPAN_KIND]: opentracing.Tags.SPAN_KIND_RPC_SERVER,
'db.type': 'influxdb',
'db.statement': query
const client = influxClient();
const result = await client.query(query, options);
return result;

You can see this allowed me even embed the specific query into the span. It’s a lot of meta-data baggage to add to a span, and I think it’ll be useful, but other than cool to see it hasn’t paid off yet. As we promote this experiment into active systems with real usage, the tradeoff of consuming more space and bandwidth against the value will hopefully become a bit more clear. I might be able to get just as much value from a shorter tag, or a more meaningful grouping of the span names, instead of the explicit query itself.

Much of the time consuming lifting we’re doing is with time series data stored in InfluxDB, so while I wanted to see the mongoDB included to provide the context, I expected more detail to be useful when sub-spans were doing something with influx – and in the end, that was a pretty good assumption.

There were a few places in the code where I manually created a subspan and wrapped some logic in the code with it. We use async/await through-out our code to keep it easier to understand and reason about, so I was tempted to try and wrap the promise and make that a nice boundary for a span, but in the end I found just a few strategic additions around heavier code in the routes or elsewhere did what I wanted, and was a bit simpler.

Bringing it all together and seeing results

In the end, I have nearly all functions that are called from the API endpoints accepting a span as an optional parameter, and all the functions were updated to create spans regardless of a parent, using the logic above. With the tests so actively working the code at various levels, just running our local tests generated a bunch of traces, and I found quite a bit of value in seeing the progress I was making by running the test suite and then viewing all the traces reported into Jaeger on my desktop.

Some of the heavier lifting in the code is post-processing data we get back from influx – some of which can be quite extensive. When we wrapped some of the “what appears to be simple functional code”, it quickly highlighted how expensive getting that data (with our chosen way or algorithm) was – and it highlighted a few O(n^2) methods that were a lot more obvious in hindsight than in reading the code.

I also found that it was relatively easy to forget to close off a span, and when you do that, what happens is that you simply don’t see that span reported in Jaeger. So I quickly found myself focusing on code to make sure the sub-spans I expected to be there showed up as a means of knowing if I’d wrapped the logic correctly.

In the end, getting to this point ended up consuming probably 16 to 20 hours of time, and was mostly repetitious work that you could easily see progress on while you were doing it. And even at this point, with just the local test suite driving effort, there was a clear and valuable pay-off that starts to highlight “why some things are taking a while”.

Deploying to staging

The next step was to see it operating with some closer-than-test-suite real data. The infrastructure for astorage system for traces was more than I wanted to tackle to start getting data back. Jaeger does a good job of setting it up, but there’s a lot moving parts for a “downtime holiday project”.

We run all our code in a kubernetes cluster, and since I had a ephemeral memory-based container already available, I made a service of it with Kubernetes. JaegerTracing has some Kubernetes templates for those inclined to explore, and they helped me understand what will be needed in the future for a more solid, use-it-constantly implementation of a service.

I ended up forking what they had available and using it to make my own “ephemeral jaeger” using just in-memory storage.

# Copyright 2017-2018 The Jaeger Authors
# Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except
# in compliance with the License. You may obtain a copy of the License at
# Unless required by applicable law or agreed to in writing, software distributed under the License
# is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
# or implied. See the License for the specific language governing permissions and limitations under
# the License.
apiVersion: v1
kind: List
- apiVersion: extensions/v1beta1
kind: Deployment
name: jaeger-deployment
app: jaeger
jaeger-infra: jaeger-deployment
replicas: 1
type: Recreate
app: jaeger
jaeger-infra: jaeger-pod
annotations: "true" "16686"
- env:
value: "9411"
image: jaegertracing/all-in-one
# all-in-one Dockerfile: <>
# CMD ["--sampling.strategies-file=/etc/jaeger/sampling_strategies.json"]
args: ["--sampling.strategies-file=/etc/jaeger/sampling_strategies.json", "--memory.max-traces=20000"]
# all-in-one image is an in-memory image, with no default limit on how much memory will be consumed capturing
# traces, so this puts a little bit of a limit on it.
name: jaeger
- containerPort: 5775
protocol: UDP
- containerPort: 6831
protocol: UDP
- containerPort: 6832
protocol: UDP
- containerPort: 5778
protocol: TCP
- containerPort: 16686
protocol: TCP
- containerPort: 9411
protocol: TCP
path: "/"
port: 14269
initialDelaySeconds: 5
- apiVersion: v1
kind: Service
name: jaeger-query
app: jaeger
jaeger-infra: jaeger-service
- name: query-http
port: 80
protocol: TCP
targetPort: 16686
jaeger-infra: jaeger-pod
type: NodePort
- apiVersion: v1
kind: Service
name: jaeger-collector
app: jaeger
jaeger-infra: collector-service
- name: jaeger-collector-tchannel
port: 14267
protocol: TCP
targetPort: 14267
- name: jaeger-collector-http
port: 14268
protocol: TCP
targetPort: 14268
- name: jaeger-collector-zipkin
port: 9411
protocol: TCP
targetPort: 9411
jaeger-infra: jaeger-pod
type: ClusterIP
- apiVersion: v1
kind: Service
name: jaeger-agent
app: jaeger
jaeger-infra: agent-service
- name: agent-zipkin-thrift
port: 5775
protocol: UDP
targetPort: 5775
- name: agent-compact
port: 6831
protocol: UDP
targetPort: 6831
- name: agent-binary
port: 6832
protocol: UDP
targetPort: 6832
- name: agent-configs
port: 5778
protocol: TCP
targetPort: 5778
clusterIP: None
jaeger-infra: jaeger-pod
- apiVersion: v1
kind: Service
name: zipkin
app: jaeger
jaeger-infra: zipkin-service
- name: jaeger-collector-zipkin
port: 9411
protocol: TCP
targetPort: 9411
clusterIP: None
jaeger-infra: jaeger-pod

Most of this detail is stock from their examples, and offers a bunch of functionality that I’m not even touching (alternate transports and offering drop-in replacement functionality to accept Zipkin traces).

Buried in this detail is an option that I found on the all-in-one container to limit the number of traces that would be stored in memory, and since I wasn’t putting many other limits on this system, I wanted to put in at least some constraining factor. You can find it in args in the pod spec:

args: ["--sampling.strategies-file=/etc/jaeger/sampling_strategies.json", "--memory.max-traces=20000"]

20,000 traces may end up being way to small, but for an ephemeral service implementation, I thought it made a reasonable starting point.

The other trick needed to make this all work in Kubernetes was adding something to listen for the UDP data was being sent that can on the same Pod as the API code, and forward that into this ephemeral service I just set up. This part I had in notes from my work on Kubernetes for Developer, so I picked it back up and used it again almost verbatim. The end result is specifying two containers (this pattern is called a “side car”) in the template that we used for the API code, so the additions end up looking like:

{{- if .Values.tracing.enabled }}
- name: jaeger-agent
image: "{{ .Values.tracing.image }}:{{ .Values.tracing.tag }}"
- containerPort: 5775
protocol: UDP
- containerPort: 5778
- containerPort: 6831
protocol: UDP
- containerPort: 6832
protocol: UDP
- "/go/bin/agent-linux"
- ""
{{- end }}

We use Helm for the deployment into a cluster, and this is a snippet from the deployment template in our helm chart. The default values we’re using for the agent are:

  • image: “jaegertracing/jaeger-agent”
  • tag: “1.8”

With this in place, we deployed the tracing basics into our staging environment where we have more realistic integrations and data flowing, the end result of which is a tremendously more valuable solution. We access the Jaeger console with a port-forward right now, and this provided us with a way to quickly open a window into what was taking time behind the API calls and start to get to some really good conversations with this data as a base.

Where next?

This whole holiday experiment has been a great success so far, but there’s more work to do to use it seriously. Rolling the ephemeral tracing infrastructure into our production environment and switching from constant/trace everything to a probabilistic tracer is likely one the first tasks.

The API in question is the back-end not directly for our customers, but for an Angular web application in a browser. How we’ve structured those calls from Angular, when we’re doing them, and how much we try to do in each call is where I expect to get some immediate value, as well as being able to answer the questions behind “why is it taking so long for this to render?”

I have this crazy idea about viewing portions of the single page application (what gets displayed) as a sort of costed effort. Now that we have “time taken” and the visibility o where it goes, we can look at what we load, when we load it, and “how much it costs” – from a time and user responsiveness point of view. So maybe we could annotate a wireframe with “how expensive” each data visualization piece is, even if it’s manually done with on a whiteboard, using Jaeger data about the “time-expense” of the API needs needed as the source for that information.

It’s also quite possible that I’m a little too “budget concept” happy, and the majority of value will be in developers, designers, and product managers in our company just being able to have a concrete way to see and address the desire to make really outstanding web-based user interfaces.

When we have it loaded into our production environment, then getting familiar with Jaeger filtering and that UI may also support us identifying problems quite a bit more effectively, and I’m actively looking forward to seeing this bolster the data we’re already capturing with our production monitoring with Prometheus.

And finally, I’m also interested to see where Grafana 6.0 goes in the future, which provides such lovely overlays for metrics capture, building now into logging, and perhaps linking that to traceing in the future.

I’d love to be able to add what’s happening in the Angular application and cascade the tracing from there. I don’t know of the way (outside of a lot more service infrastructure, and getting a LOT more familiar with jaeger-client and making a variant to work from a browser) to capture and forward traces. I’m honestly not sure I’d always want it in place, where I rather do want it always there and on at the API layer, at least while it’s not overly negatively impacting the service calls. There might be something interesting we can do for a single browser with a special setup that would allow us to profile from the user interface all the way back to the services, so there’s another interesting thought experiment for another down-time.

Underneath your code

Underneath your code is the working title for a new writing project. I started coming up with the idea for this a couple of months ago based on what I’ve been learning and hearing. Some of this comes as a follow-up from Kubernetes for Developers, and some of this is just aiming to help new developers.

A lesson that has been reinforced for me recently is that I can too easily assume a breadth of knowledge that doesn’t exist. I have been working with a couple of new developers recently, and there have been a number of concepts they weren’t yet familiar with that caught me by surprise. As I went looking for public resources to help them learn and dig into the topics they were interested in learning more deeply, what I found was a lack of cohesive and consistent information. All the detail is out there, even freely available, but scattered in a lot of different places. It is hard to pull together, and even harder to make the connections of why some of these details are interesting or relevant.

I’ve recently also seen some pushback when it comes to Kubernetes and Developers. The general meme as I’m perceiving it is “don’t expose Kubernetes to developers, they don’t need to know or it’s just too confusing”. Ahmet has great points about it being confusing or perhaps not being the best or correct abstraction over operational needs, but I think backing away from making Kubernetes more accessible and useful to developers is the worst possible path. Outside of my own opinion that using Kubernetes is a good and reasonably efficient way to help keep software running reliably and consistently,  I think developers should absolutely know the world in which their code lives, and the details of how that runs, or not. Lots of general operational concepts are just unknown, confusing, or perceived as irrelevant or that should be hidden. I think it’s a lot better to be able to have a holistic view and knowingly choose what you hide and abstract away, and even more so what constraints come with those choices.

What I am starting is the next layer down from what I wanted to create with Kubernetes for Developers. Instead of writing about an operational tool for developers, I am aiming at writing about the underlying concepts, systems, and common mechanisms. My goal is to make that information accessible and understandable for new developers. My first (terrible) working title was “Ops for Devs” as a lousy inverse of “DevOps”.

I thought about writing a book in the classic fashion: editor, publisher, etc – but only for a very brief period. I want to take this in a different direction. I want to publish and promote this work to a broad set of new developers. To enable that goal I want to make the content solidly open and freely accessible. At the same time, I also want to get professional editing help (they are such a godsend for writers), so I am aiming to collect all the pieces together as a PDF, ePub, and mobi and make it available for sale as a means of funding an editor, and perhaps some of the other niceties such as graphics and illustration.

The work from Julia Evans and her amazing and accessible Zines has been an inspiration. The comic style isn’t what I feel comfortable creating, but I’m constantly referring people to specific zines as a good intro/starting point for a variety of topics, and as a fun seed that can lead them to digging deeper and more broadly.

Tooling is a bit of an open question. I have written in word processors (Scrivener, MS word, google docs), using markup (reStructured Text, Markdown), and use a variety of publishing/rendering tools (Sphinx, ReadTheDocs, and more recently Hugo). Something that can be stored in plain text and usefully read from there is important to me, so I’ll probably ditch the proprietary word processors and see what I can do with content sourced in GitHub, GitLab, or Bitbucket.

Just recently I started looking at GitBook and LeanPub. I heard some great things about LeanPub recently, but my own experiments haven’t been very successful:

The lack of my success with LeanPub may be a strong constraint on not having 2-factor auth enabled for Github in order to use their service, which just scuttles me – as the Kubernetes organization very reasonably requires 2-factor auth to be enabled. It’s not entirely clear if this is the problem or not though, because when I render sample content I get just a vague “something broke, go check what you changed” message with no details about the failure for me to diagnose.

With the what and where still up in the air, I’ll probably take a little time to look at AsciiDoc and AsciiDoctor for the generation. To get started, I fell back to a tool I’ve loved in the past: Scrivener, and took an initial stab at doing the markup for it in Sphinx and hosting it on ReadTheDocs.

How ever I end up playing it, an editor (or editors) and technical reviewers will definitely fit in. I was disappointed with my “editor experience” using the last publisher I worked with, but I have met so many great editors out there that I think it will be quite possible to find one (or a couple) to work with and pay them directly. Even though I had to back away due to time constraints a few months ago, working closely with the Kubernetes Docs team was a great experience, and really cemented the idea that I could find editorial help outside of a larger publisher.

Having done technical writing on and off for a couple of decades now, I’m very familiar (and slowly getting comfortable) with how fundamentally ephemeral it really is. The content from Kubernetes for Developers may have an 18+ month lifespan in terms of being useful, and when I started the project I expected that to be as short as 12 months for a useful lifetime. Looking at the work six months after publication, and reviewing how the Kubernetes project has evolved – it looks like I managed to nail enough core content that at least half of it will be viable for quite a bit longer.

One of my struggles while writing was the amount of time and effort it took to do good writing for the relatively limited lifespan of the content. Technology changes fast; products and platforms develop and evolve in weeks and months. They change, grow, and yeah – die off as well. I view it as an interesting challenge, and my current thinking is to work on shorter, more directed topics rather than larger, more expansive, volumes of work. I also want to keep a published reference material clear on its biases – what’s experimental and strongly opinionated I prefer to keep or see in a blog rather than what I think of as a cohesive publication.

The space for more exploratory, short, and opinionated efforts is still critical. Where the options are evolving quickly (for example, service meshes and ingress options in Kubernetes) I think it may be better to get at least some “how to” information available, over making it fully integrated and cohesive.

There is no denying the success and value of sites like StackOverflow, of which I’m simultaneously grateful and annoyed. The content on the site is a beautiful example of caveat emptor and the need for critical thinking in reviewing and accepting the answers. It is also an amazing communal resource for information, answers, how-to, and examples. By its very nature it lacks a cohesive voice, style, or guide to what is available – and it is a great resource as it stands. What it does well, and what I find lacking from it, are influencing what I want to write, as well as how I might organize it.

The loose outline that I’m starting with for Underneath Your Code:

  • Where your code runs – what a “process” is
    • Running code in a browser vs. running directly on an OS
  • Operating system basics
    • Processes, file systems, memory, and networks
    • A little deeper on networks: DNS, IP addresses, ports & sockets
    • IP v4 and IP v6, TCP & UDP, DHCP, ZeroConf/Avahi
    • WTF is a 7 layer ISO model anyway, and why do I care?
  • Physical & virtual devices and IO
    • Serial ports, block and character devices, what is POSIX
    • USB and Bluetooth
    • Specialized hardware (GPUs, Accelerators, etc)
  • Practicals for working with processes
    • shell scripts and some of Unix CLI concepts
    • commands, pipes, STDOUT, STDERR, STDIN
    • environment variables, and shell tests
  • Some basics about how operating systems work
    • Init, systems, hierarchy of processes and how they coordinate
    • Kernel vs. user space and permissions, and privileges
    • Memory, Buffers, and the various dials that can be tuned (queue theory)
    • Sandboxing, background tasks, and scheduling
  • Another layer of indirection
    • Containers and VMs
    • Shared vs emulated resources
    • Cloud resources to IoT: you have a budget: CPU, Mem, IO
    • Smaller and smaller bits of computation – microprocessors and embedded devices
  • Physical stuff breaks, all the time
    • Redundancy and consistency
    • Networks, latency, and information theory
    • Why and how these fundamentals expose constraints for development work

Almost all of the topics in the outline could be (and in many cases are) books in their own right. While aiming to make an overview accessible and understandable, I am not covering every corner and case. This may be more sensibly organized as several works. I’m sure I’ll re-organize it several times when I get into the writing and trying to keep a more-or-less consistent narrative.

If you have feedback or thoughts on what would be useful, I’m all ears. You can poke me on social media or leave a comment here.


Review of using Helm to package and host applications

The open source project Helm represents itself as a package manager for Kubernetes. It does that pretty darn well, having been attached to the project from the earliest days of Kubernetes, and continues to evolve alongside, and now a bit more separately from, the Kubernetes project.

Looking at Helm version 2, it provides three main values

  • As a project, it coordinates and reinforces a number of “best practices” in how to do that generation by housing a public collection of some of the most common/popular open source projects, packaged and “ready to use” within a Kubernetes cluster.
  • It provides a templating solution to generate the plethora of YAML files that make up the descriptions and definitions of Kubernetes resources.
  • It provides a “single command” tool for deploying one or more projects to an existing Kubernetes cluster.

The first two values have been the most meaningful to me, although with some definite caveats and gotchas. The third I thought would be more valuable. I still use the single-deploy-command regularly, although I’m questioning if it is a crutch that will ultimately be a trouble point.

More in depth for each of these:

Package Repository

The single most powerful aspect of Helm isn’t the code itself, but the result of the community and contributions those community members have wrought. The charts collection is the defacto set of examples of how to organize, structure the inputs, and run software with the resource concepts of Kubernetes.

While it calls itself a package manager, there’s a gap if what you are expecting from it is a single binary package that is installable – the moral equivalent of an .rpm .deb, .apk, or installer exe file. You don’t get a single file – Instead you get a set of configuration values alongside a set of templates. The configuration values are the defaults used with the templates to generate the description of the Kubernetes resources. These default values can also be overridden when you invoke helm, which is a godsend for scripted deployment (such as CI/CD). The resource descriptions generated from the templates expect (and require) the actual software you’re running to be available via one of the public container registries –, quay and DockerHub being the three most common referenced. The software you’ll actually be running – the container images – is (intentionally) not included within the helm chart.

If you want to run your Kubernetes cluster where it can’t take advantage of the public internet and those resources, be aware and prepared. I would not be surprised to see caching proxies of those services develop, much like Artifactory and Nexus developed for the Maven build tooling. In fact, I would keep a close eye on Harbor (technically open source, but dominated by the VMware) to see what might develop to help them deploy in more isolated scenarios. It is not all that difficult to use private and local container repositories, just be aware the public packages expect to use the public container repositories.

Pervasively embedded within the templates is a fairly robust and opinionated set of how to take advantage of Kubernetes. The content of the  templates contains a ton of knowledge, but be aware it is not always consistently applied or exposed. Like many projects it has learned from experience, so newer and more recently updated charts will often reflect different views of what is important and useful for deployment. None the less, they provide a huge number of functional and effective patterns for running software.

These patterns are the strongest where the features have existed and been stable within Kubernetes for a while – Pods, ReplicaSets, and the effective use of the side car pattern for buckling on ancillary functionality or concerns. It is weaker (or perhaps viewed differently: various levels and consistencies of workarounds were created) for some of the new features in Kubernetes, including StatefulSets and even Deployments.

In some cases, early workarounds were so heavily embedded that they persisted long after the need existed: the concept of Helm “deploying and managing state” was a filler to the gap of Kubernetes not having a solid Deployment concept earlier, and the whole world of custom resources and extending Kubernetes with operators overlaps with what Helm enabled with hooks. My perception is that both Kubernetes and Helm charts are struggling with how to best deploy these newer structures, which themselves represent often some operational knowledge or intended patterns.

Like the virtual machine focused brethren (Chef and Puppet) before them, Helm added testing and validation for their charts. The chart validation has expanded significantly in the past 18 months. Like any validation, they do not guarantee 100% effectiveness. Even still, I believe it is important to be willing to review what the chart is doing, and how it’s doing it, before using it. The instances of charts failing with newer versions of Kubernetes has decreased significantly, primarily due to the focus of the Helm community on recognizing and working to expose it as a problem and resolve it when it occurs.


A bit background – Kubernetes resources can be defined in either JSON or YAML, and are a declarative structure: a desired state of what you want to exist. These structures are loosely coupled, “tied together” with the concept of labels and selectors. This is both a blessing and curse, providing a lot of flexibility, but if you typo or mismatch some of those connections, there can be very little – to no – checking and it can be quite difficult to debug.

In creating these resource manifests, you will often find yourself repeating the same information, sometimes many times – or explicitly using repetition to tie pieces together. It is ripe for the solution that developed to this repetition and boilerplate overhead: templating.

Helm uses (and exposes) the Go template library Sprig, to greater and lesser degrees. From using the templating language, my opinion is that it is no better (or worse) than any other templating system. It has many of the same concepts that you might find in other templating systems, so if you are already familiar with a templating language, picking up the one used by Helm may be awkward but really is not too bad.

There are variants in other projects that enable a similar functionality to Helm (KSonnet, and the now mostly ice-boxed Fabric8). Even with competition, the network effects from Helm’s collection of charts makes it very hard to compete. Most solutions in this space have to make a choice of how much of a language to build vs. how simple the templates are to use – a continuum between a fully fledged programming language and simple, targeted replacement of values. Helm’s choice adds in some language structures (concepts of making and using a variable, and transforming values), but holds back from the slippery slope into a fully custom language.

We will see if that holds true with Helm version 3’s development, which will be adding the language Lua into the mix, although it appears more for handling the deployment hooks aspect.

If you are a NodeJS, Ruby, or Python developer and looking at the charts, you may have more confusion around what the resource you’re trying to create should look like rather than any trouble with the templating language itself. The templating does nothing to encapsulate or simplify Kubernetes resources and how they may be used together (or not). Helm itself has two commands that have been lifesavers in learning the templating and using charts:

helm template


helm –debug –dry-run

These two commands run the templating engine and dump (with slight variances in what they’re expecting) the rendered results. You still end up seeing (and having to deal with) the “wall of YAML” that is Kubernetes, but these two commands at least make it easy to see the results of the templates after they render.


The exciting (yes, I get excited about weird things) capability of Helm to deploy my applications in a single command reinforced the the concept of it being a package manager, but may ultimately be the biggest crutch of the solution.

As mentioned earlier, Helm “grew up” with Kubernetes, and was alongside the project from its earliest days, covering the gaps from the core of Kubernetes to the cold hard reality of “getting your stuff running” in a cluster. Helms’s concept of Tiller may be the earliest seed of an operator, as it installs itself into your cluster and then handles the deployment and management of resources that it manages. This same capability is more recently codified into custom resources and the operator pattern, as well as the simplest and most common use cases being covered by the Deployment resource and the associated controller.

When Kubernetes finally included RBAC as a default, Helm (and how tiller was installed) illuminated a bit of a hole in how many people were using and deploying software. There was a lot of work exposing and thinking about how to properly secure Helm. Helm 3 will be removing Tiller from the concept of Helm, continuing to evolve with Kubernetes features.

You also don’t strictly need to use this capability of Helm, although it is darned alluring. As mentioned in the section on templating, you can render charts (and their templates) locally and use tools such as kubectl to apply the resulting resources to a cluster.

Having a single command that is easy to add into a script has been a godsend for continuous deployment scenarios. It is what powers GitLab’s “AutoDevOps” hosted continuous deployment. I use the deploy-with-a-single-command myself, and plan to continue to do so, but it comes with a price.

Helm likes to own the whole lifecycle of the release and does not expect or accommodate anything else interacting with the stuff it is managing. In many cases, this is completely fine, but in some cases it can be a pain in the butt. The most common scenario usually involves some manner of persistence – where you want to install and get the software running, and need it operational to do further configuration on how you want to use it. This could be anything from linking multiple instances of some software into a cluster, doing a database schema migration, or doing a restore of prior backups.

Helm 2 has the concept of hooks to help with actions that happen repeatedly and consistently with every deployment or upgrade process. Helm 3 will be expanding on these concepts, although I don’t yet know the specifics, with Lua as the scripting language to enable this functionality and potentially more.

I am personally conflicted on the inclusion of Lua and what it implies for Helm. While Lua is a lovely scripting language, and likely the best choice for what the developers decided they wanted to do, I think it may end up being a new barrier to adoption for developers outside of the Helm charts/Kubernetes space. Every developer that sits down to use Kubernetes comes with their own biases and comfort with scripting languages. They are often used to Python, Ruby, Javascript, or any of a number of other languages. If Lua becomes an implicit requirement for them to use Helm to accommodate their own operational needs, I suspect that barrier will be significant. Because of this I am hesitant to be excited about the inclusion and focus on using Lua with Helm. What it will ultimately mean in terms of developer accessibility to using Helm and Kubernetes together is yet to be seen. I hope it won’t be an even larger and steeper learning curve.

For the scenario where you want to do periodic, but not consistent, interactions – such as backing up a database or doing a partial restore or recovery – you need to be very aware of the application and its components in their lifecycles. In these scenarios, I have not found a terrific way of using Helm and its hooks to help solve those problems.

Kubernetes itself is only partially effective in this space, having the concept of jobs for one-off batch style mechanics. However, jobs can be darned awkward to use for things like a backup. While I used jobs and continue to try and make them work, I often revert to using kubectl to directly interact with temporary pods to get the work done consistently.

With Helm, I struggled with creating job resources that utilize the same ConfigMap, Secrets, and variables that are used with the charts. Helm is crappy at doing this if you’re using the deploy-with-a-single-command mechanism. An as I mentioned earlier, Jobs can be an awkward fit with the use cases I am trying to accommodate. These scenarios are more “pets” and “one-off” needs where knowledge of the underlying systems and their current state are critical. It may be that operators will ultimately win out for these use cases, but they have a fair way to go yet.

At its heart, this deployment capability that I use implicitly use many times a day also strikes me as the current edition of Helm’s weakest point, and I wonder if it is a crutch that I will ultimately need to replace.