A bit over a year ago, I wrote out Six rules for setting up continuous integration, which received a fair bit of attention. One of the items I called out was the speed of tests, suggesting to keep it to around 15-20 minutes to encourage and promote developer velocity.
A few weeks ago, the team at SemaphoreCI used the word “proper” in a similar blog post, and suggested that a timing marker should be 10 minutes. This maps pretty directly to a new feature they’re promoting to review the speed of their CI system as it applies to your code.
I disagree a bit with the 10 minute assertion, only in that it is too simplistic. Rather than a straight “10 minute” rule, let me suggest a more comprehensive solution:
Tier your tests
Not all tests are equal, and you don’t want to treat them as such. I have found 3 tiers to be extremely effective, assuming you’re working on a system that can have some deep complexity to review.
The first tier: “smoke test”. Lots of projects have this concept and it is a good one to optimize for. This is where the 10 minute rule that SemaphoreCI is promoting is a fine rule of thumb. You want to verify as much of the most common paths of functionality that you can, within a time-boxed boundary – you pick the time. This is the same tier that I generally encourage for pull request feedback testing, and I try to include all unit testing into this tier, as well as some functional testing and even integration tests. If these tests don’t pass, I’ve generally assumed that a pull request isn’t acceptable for merge – which makes this tier a quality gate.
I also recommend you have a consistent and clearly documented means of letting a developer run all of these tests themselves, without having to resort to your CI system. The CI system provides a needed “source of truth”, but it behooves you to expose all the detail of what this tier does, and how it does it, so that you don’t run into blocks where a developer can’t reproduce the issue locally to debug and resolve it.
The second tier: “regression tests”. Once you acknowledge that timely feedback is critical and pick a marker, you may start to find some testing scenarios (especially in larger systems) that take longer to validate than the time you’ve allowed. Some you’ll include in the first tier, where you can fit things to the time box you’ve set – but the rest should live somewhere and get run at some point. These are often the corner cases, the regression tests, integration tests, upgrade tests, and so forth. In my experience, running these consistently (or even continuously) is valuable – so this is often the “nightly build & test” sequence. This is the tier that starts “after the fact” feedback, and as you’re building a CI system, you should consider how you want to handle it when something doesn’t pass these tests.
If you’re doing continuous deployment to a web service then I recommend you have this entire set “pass” prior to rolling out software from a staging environment to production. You can batch up commits that have been validated from the first tier, pull them all together, and then only promote to your live service once these have passed.
If you’re developing a service or library that someone else will install and use, then I recommend running these tests continuously on your master or release branch, and if any fail then consider what your process needs to accommodate: Do you want to freeze any additional commits until the tests are fixed? Do you want to revert the offending commits? Or do you open a bug that you consider a “release blocker” that has to be resolved before your next release?
An aside here on “flakes“. The reason I recommend running the second tier of tests on a continuous basis is to keep a close eye on an unfortunate reality of complex systems and testing: Flakey Tests. Flakes an invaluable for feedback, and often a complete pain to track down. These are the tests that “mostly pass”, but don’t always return consistently. As you build into more asynchronous systems, these become more prevalent – from insufficient resources (such as CPU, Disk IO or Network IO on your test systems) to race conditions that only appear periodically. I recommend you take the feedback from this tier seriously, and collect information that allows you to identify flakey tests over time. Flakes can happen at any tier, and are the worst in first tier. When I find a flakey test in the first tier, I evaluate if it should “stop the whole train” – freeze the commits until it’s resolved, or if I should move it into a second tier and open a bug. It’s up to you and your development process, but think about how you want to handle it and have a plan.
The third tier: “deep regression, matrix and performance tests”. You may not always have this tier, but if you have acceptance or release validation that takes an extended amount of time (such as over say a few hours) to complete, then consider shifting it back into another tier. This is also the tier where I tend to handle the (time consuming) and complex matrixes when they apply. In my experience, if you’re testing across some matrix or configurations (be that software or hardware), the resources are generally constrained and the testing scenarios head towards asymptotic in terms of time. As a rule of thumb, I don’t include “release blockers” in this tier – it’s more about thoroughly describing your code. Benchmarks, matrix validations that wouldn’t block a release, and performance characterization all fall into this tier. I recommend if you have this tier that you run it prior to ever major release, and if resources allow on a recurring periodic basis to enable you to build trends of system characterizations.
There’s a good argument for “duration testing” that sort of fits between the second and third tiers. If you have tests where you want to validate a system operating over an extended period of time, where you validate availability, recovery, and system resource consumption – like looking for memory leaks – then you might want to consider failing some of these tests as a release blocker. I’ve generally found that I can find memory leaks within a couple of hours, but if you’re working with a system that will deployed where intervention is prohibitively expensive, then you might want to consider duration tests to validate operation and even chaos-monkey style recover of services over longer periods of time. Slow memory leaks, pernicious deadlocks, and distributed system recovery failures are all types of tests that are immensely valuable, but take a long “wall clock” time to run.
Reporting on your tiers
As you build our your continuous integration environment, take the time and plan and implement reporting for your tiers as well. The first tier is generally easiest – it’s what most CI systems do with annotating in pull requests. The second and third third take more time and resources. You want to watch for flakey tests, collecting and analyzing failures. More complex open source systems look towards their release process to help coral this information – OpenStack uses Zuul (unfortunately rather OpenStack specific, but they’re apparently working on that), Kubernetes has Gubernator and TestGrid. When I was at Nebula, we invested in a service that collected and collated test results across all our tiers and reported not just the success/failure of the latest run but also a history of success failure to help us spot those flakes I mentioned above.