Ship tests with the code

It is conventional wisdom (has been for years) that you don’t want to ship your tests with your code. When I was doing quite a lot of build engineering, there was always the step in creating whatever packaging that separated the test code from the production code. Maybe that was excluding packing some java classes into a jar, or just shifting off a subdirectory from python, ruby or javascript so that the packaged library didn’t get “bloated” with the tests.

I think we should reconsider that stance…

I’m not sure what’s changed my mind about this, or that I have a new concept fully formed. Some of this extends the concept of healthz that Kelsey Hightower talked towards at Monitorama in 2016. Some of it is continued thinking on fully autonomic systems that I’ve written about in this blog over the past year:

My own experiences with building and integrating tests into CI and CD pipelines in a number of languages reinforces that viewpoint. I see clear benefits to distributed development teams using TDD or tests as service contract validation. I first used this sort of concept some 17 years ago at the startup Singingfish to validate service contracts for a pipeline to build data for a streaming-media search engine. I’ve re-used “tests as a validation” in nearly ever position since then.

There is a huge overlap in value between tests written during the development process and what you need to do to when using debugging and isolation to hunt down a problem with your code is running in production. Ironically it is usually when tests break that you realize that your own testing solution may not be making it easy to debug and isolate a relevant problem – which highlights a good time to refactor the tests! (For the “test pedants” out there, I’m speaking about functional or integration tests) I’m distinguishing functional and integration tests from unit tests. Unit tests are often too granular to be beneficial for the idea I’m thinking about, and the pieces that are changing as your code runs – service status, persistent backends, etc are often better covered by slightly higher level tests.

The challenge is that there isn’t a consistent way to package and deploy tests that work against a system, and there are complications in that sometimes what you want to test and validate will change the state of the system. Or perhaps better stated, every language, and nearly every deployment methodology, has it’s own method – and there’s a very diverse set of conventions with unfortunately little overlap.

Container deployments are shifting the granularity of deployment to a level where tests relevant to a single container as a diagnostics and validation system seem not only feasible, but are getting to be a damned good idea. The concept of healthz, health checks, and liveness probes that are included with container operations systems are some of these tests. If you’re deploying with Kubernetes, Nomad, or some of these newer tooling sets, you may have already built some of these API endpoints into your code.

I can go all star trek and start suggesting “diagnostic levels”, assigning potential impact and requirements for running the tests and what it means for systems. If you have systems like hystrix, or some other circuit breaker concept, in your distributed system and or if your software is generally antifragile, you might be able to run these diagnostics fairly frequently with minimal notice or impact. Chaos Monkey (and related tools) that Netflix popularized to abuse distributed systems is a variant of this kind of validation.

Perhaps there could be a consistent means by which we can expose the ability to invoke these diagnostics. A controlled interface to run diagnostics as a core and functional part of the services we’re building. A diagnostics REST API, or gRPC endpoint. The read-only simple version could extend on something like the prometheus metrics format so that multiple higher order systems can read and interpret the information. More complex APIs using POST or some RPC mechanism to invoke more intrusive levels of diagnostics when needed.

I don’t know what form it could or should take, but something that could be invoked by Docker, Marathon, Kubernetes, CloudFoundry, Heroku, or even the upcoming “serverless” frameworks like OpenWhisk or Fission, would be the goal. Something to provide information back to the humans building and responsible for running the service to understand what’s functional and not. We’re not even at healthz yet, but we could be, there and beyond it: writing and shiping software that has the capability to validate it’s own functionality.


Published by heckj

Developer, author, and life-long student. Writes online at

%d bloggers like this: