Intermittent issues, or bugs with intermittent reproducibility, are some of the most frustrating issues to track down. In past experience, these kinds of bugs are often multi-thread race conditions or a failed assumption about something that’s happening asynchronously that you thought was synchronous (or just expected to happen quicker). Tracking these down is immensely time consuming, and most of the existing test frameworks and infrastructure isn’t really well suited to help.
A common pattern that I’ve seen (and adopted) is just repeating the tests again and again to try and flesh out the error, (hopefully) with some automation help. I don’t know a name for this process, so I call it “stochastic testing”. Fortunately, it’s generally pretty easy to set something like this up with with continuous integration products (i.e. Jenkins), but while that automates the task of repeating the specific tests, it doesn’t really help you deal with the results and data – and that definitely ends up being the challenge.
There’s a continuum. The simplest is just re-running a single test because you’re seeing an error. I expect most people working with a CI system have seen this, even if they haven’t tracked it down or used stochastic testing to help identify the problem. It’s that time when maybe your build cluster is overburdened, and instead of hunting down the bug, someone tacks in a sleep(10) into a test script. As much as it pains me to admit it, sometimes just adding that sleep() call is the right answer – often you’re waiting for a database to be live, or a connection to finish establishing, and there just isn’t completely causal traceability to hook into know for sure when something is ready to be used. A common secondary theme here is seeing people make a pre-test that validates what should be invariants prior to the actual test running to avoid errors. It’s more work, often slower, but it gets the job down – and reliability in tests is a darned worthy goal.
Then there’s the classic multi-threading race/condition bug, or maybe a race condition between multiple processes that are expected to be used in coordination, but perhaps don’t always check. I’m seeing the later more and more with distributed systems being a common way to build services. A quirk of timing may be causing an exception somewhere in a complex interaction or logic chain, and without knowing everything about all the components, the most expedient way to find it is to just re-run the tests – maybe 10, maybe 1000 times, to find and expose the flaw. Here’s where the data about what you’re doing starts to become tricky – your environment is (hopefully) fairly controlled, but you’re liking going to have logs from multiple separate processes, output from the testing process itself, and all of this compounded by the number of times you’re running the test as well as the need to distinguish the tiny bits that are meaningfully different in these various test runs.
This is also about the level that I’ve found the most benefit in just tracking the test results over time – with a given build, repeating the same sequences, and reporting the percentage of failures and the failures over time. Especially when you’re working on a “release” and stabilization, the time to do these tests pay off. You can verify you don’t have memory leaks (or memory retention cycles, as the case may be), or check for reactions in the system that you didn’t anticipate (like it slowing down after running for a while due to naive algorithm choice). This is also where I like to put in micro-benchmarks, or even full benchmarking code – so I can look at relative performance trade offs between builds.
And the continuum extends right out into running in a public beta test, or perhaps production. In these cases, maybe there isn’t tests – it’s just straight up usage, which complicates tracking down the issues even more. And in these scenarios, you often only get a traceback or maybe a crash report with a little more state detailed. By several measures, this is the most common scenario, and the one that has some explicit tooling created to help with the data tracking and analysis. In 2014, the Mozilla test organization posted some details about their mass collection and processing of crash reports. The project called Socorro has evolved since, and there are commercial products that have similar infrastructure for collecting failures and providing a useful view into what would otherwise be overwhelming aggregate data. TestFlight, Sentry, Rollbar, and Crashlytics all come to mind, focused on either web services or mobile device crash reports. Quite a number of vendors also have their own “on failure, phone home” systems with various levels of detail or metrics available to provide support for their products. At the very least, it’s common to find a “generate diagnostic output” for support from a purchased product.
What I haven’t seen outside of some custom work is the tooling to collect the output of repeated testing, logs of processes, and pro-actively do analysis to provide a state/status dashboard for that testing. When I was at Nebula, the engineering infrastructure team built up a large regression testing infrastructure. Running some of the acceptance tests in a stochastic mode – repeatedly and reporting on the error percentages – was a common practice and a good measure of how we were progressing. I do rather wish that was more common (and perhaps easier) across a broader variety of technologies.
As a side note, one of the really useful details I found to highlight race condition bugs was running tests with constrained resources. It started as an a “accident” – I made the cheapest/smallest VM I could so I could have a lot of them, and in using those for the slaves in a CI system, I found far more issues than we ever did with developer machines, which tended to have a fair amount of resources. Constraining the CPU, and sometimes the memory, really highlighted the issues and could often change the sequencing of operations sufficiently to reproduce some of those hard to find race condition bugs.
The fundamental problem is getting information out of your tests and testing system to determine what’s gone wrong so you know how to fix the problem. This often resulted in culling through verbose logging and hand-correlating events. Things like distributed system tracing, profiling of the relevant systems, and monitoring of resource utilization – could have helped significantly.
Doing these processes highlights a trove of other errors that obscure the real issues. In my own experience, I confused myself (and others) by bundling too many things in a single test, or that the testing framework I used (often a unit-testing framework) did not include hooks or tooling to help capture the information needed. And finally, of course, there’s always the issue of hardware failing – drives or RAM failing in one machine, leading to both false positives and negatives in testing. In distributed systems, a temporary network blip had the same effect – often causing a ripple of otherwise unreproducible errors while testing.
Having a consistent means of collecting, analyzing, and exposing this kind of information would be a huge boon in both development and debugging. There are many variations in this space, it’s easy to find any one piece, and quite challenging to assemble a holistic solution. I’ve wondered if some hosted (or open source) data analysis solution would work. A mix of collecting metrics, events, logs, and potentially even tracing data, as well as having some knowledge of the evolution of builds/source control changes and testing that’s happening to help track down issues.
I keep circling back to this, noodling and considering how to tackle this problem. It’s sort of “missing developer tooling” for me. What would make tracking down these issues easier? What would be good metrics and questions to answer that can provide a sense of progress over time? How to reasonably represent the current “state” of quality in the face of constantly changing system? I have lots of questions, and no clearly obvious answers – just musings and theories right now. If you have something you’ve used that helped identify these kinds of problems, I’d love to know about it.