Does the game of chess show us a map to future jobs?

I have been interested in the field of Artificial Intelligence since the 80's – pretty much the heart of the "AI winter". Like many others, I remember when Kasparov lost to the IBM DeepBlue Chess AI in the 1997, and watched in fascination as DeepMind's AlphaGo beat Lee Sodel in Go in 2015.

At this point, the Kasparov defeat is sort of in the history books, as everyone is focused on the AlphaGo wins and the follow technology uses that it employed. Kasparov's interactions are kind of ignored, but they shouldn't be. What happened since that defeat with the evolution of the game of chess, with human and AI opponents, is fascinating – and I think informative.

In the wake of chess AI getting so strong, a style of chess developed called Freestyle chess, and from that – a "Centaur" model appeared. It's a team of human and AI system working together to play the game – and it's immensely powerful. There's an article from 2010 that outlines more of Kasparov's views, and it's worth a read if you're interested. I'm also certainly not the first to see this. There's articles in the Huffington Post about Centaur Chess, the BBC in 2015, and TechCrunch from last year that draw parallels.

The gist of the thesis is that current AI and human cognition aren't strong in the same areas, and each tackle the solution of tasks in very different ways. If you combine these two systems in a manner that exploits the strengths of each, you get massive capability and productivity improvements. It is exactly this pattern that I see as the future for quite a number of jobs.

I don't know exactly what form this will take, but I can extrapolate some obvious pieces – two attributes that will show up more and more.

The first is "explainability". Somewhere tucked in between psychology, cognitive sciences, and art of user experience and design is the place where you get amazing transfer of information between computing systems and humans. If you have an AI system predicting or analyzing something, then sharing that information effectively is critical to having effective coordination. Early examples of this are already prevalent in cars, planes, and military combat vehicles, which have had even more of this kind of investment. Head's up displays (HUDs) are so relevant that they're the common staple for video games today. And with the variety of first-person shooters, we're even training people en-masse to understand the concepts of assistive displays.

But don't get too focused on the visual aspects of explainability – the direction haptics and sound that work with the Apple Watch while navigating are another example – very different from a head's up display, but just as effective. Conversational interfaces like you're seeing with Alexa, Siri, or Google Home are all steps to broadening interactions, and the whole field of affective computing reaches into the world of understanding, and conveying, information at an emotional level.

Because it's so different from how AI has been portrayed in movies and the "public opinion", some folks are trying to highlight that this is completely different calling it Intelligence Augmentation, although that seems to be a corporate marketing push that hasn't gained too much traction.

The second is collective intelligence. This used to be more of a buzzword in the late 90's when some of the early successes really starting appearing. The obvious, now common, forms of this are recommendation engines, but that's really only the edge of where the strengths lie. By collecting and collating multiple opinions, we can basically leverage collective knowledge to help AI systems know where to start looking for solutions. This includes human bias – and depending on the situation, that can be helpful or hurtful to the solution. In a recommendation engine, for example, it's extremely beneficial.

I think there's secondary forms of this concept, although I haven't seen research on it, that could be just as effective: using crowd source knowledge to know when to stop looking doing a path, or even category, of searches. I suspect a lot of that kind of use is happening right now in guiding the measure of effectiveness of current AI systems – categorization, identification, and so forth.

There is a lot of potential in these kinds of systems, and the biggest problem with them is we simply don't know how to best build and where to apply these kinds of systems. There's a lot of work still outstanding to even identify where AI assistance could improve our abilities, and even more work in making it's assistance easy to use and applicable to wide diversity of people and jobs. It's also going to require training, learning, and practice to use any of these tools effectively. This isn't downloading "kung fu" into someone's head, but more like giving Archimedes the lever he wanted to move the world.

Aside

Navigating European small towns with an Apple Watch

A little before I started my recent walk-about in Europe, I invested in a Series 2 Apple Watch. I had not worn a watch in over a decade – I used to carry a pocket watch until smart phones made that extraneous. I thought I'd try it as a replacement to my twice-worn-out-fitbit since it had compelling features as a fitness tracker: The obvious step counter, and also the heart-rate monitoring along with nice integration to encourage activity and exercise. I started "closing the rings" as they say, and along with more frequent access (and visits) to the gym, it's been tremendously successful for me. Most importantly to my earliest concerns, I haven't destroyed it by being a klutz and smashing my wrist into walls and door frames as I was wont to do a decade or two ago.

The surprise win was how amazingly effective it is for helping to navigate with directions from Apple's Maps IOS app. When we were traveling about some of the smaller towns, navigating from the train station to our B&B or hotel was a common need. The hardest days were when we switched countries and were dealing with a different language, different customs for sign placement, and generally trying to learn what to listen for as the languages changed. I kept my cell phone active with a data roaming plan just for this – I knew I'd need help to find my way.

The surprise was in how subtle and effective the combination of haptic feedback and slight sounds emitted from the watch helped with navigation. One of the things that I found myself constantly failing was interpreting reasonable scale from maps (digital or paper) to the physical reality of streets in the smaller European towns. We stayed and wandered through amazing locations, and not only are the streets much smaller, the organic nature of street layout and smaller streets really lead to our confusion. What my US trained senses thought was another 100 yards was maybe only 20 away, and early navigation had frequent back-tracking or circling around as we tried to get used to the differences.

I found when I was walking/navigating through some distance (this really shone while walking in Utrecht), the "make a left" or "make a right" haptic signals that came through the watch were brilliantly timed with my walking pace. The tone of the two "blips" goes up for an upcoming right, down for a left. I still did a periodic double-take, or hunted for the name of the street to make sure I wasn't veering down some strange side path, but it was immensely effective.

The other complete win is that getting direction feedback from the watch is subtle.

I don't mind standing out periodically like a tourist newb, but not all the time. I didn't want to constantly paint a tourist-target sign on myself if I could help it. As I got comfortable with watch providing muted and haptic feedback, I wasn't doing the "spin in a circle with my iPhone" routine to figure out where I was and needed to go. The pattern that I ended up developing was selecting my target destination in my phone, getting the directions active, and then dropping it in my pocket. From there, the watch displayed the "next step" in the directions nicely, and had provisions for scanning backward (or looking ahead) – and while I was walking it was easy to glance at my wrist – as though I were looking at the time – and see a short note like "turn left in 300 feet". Possibly the best thing was the quality of the maps was outstanding and with my phone safely tucked away I spent far more time looking at my surroundings, getting familiar with whatever location, and just generally being more aware of my surroundings than I otherwise would "tracking with a phone".

The only place where Apple's Map IOS app and directions didn't shine was when I needed to use a combination of public transit and walking in some of these smaller towns. More than once, Google Maps gave better transit directions – or any transit directions in some cases – where Apple's Map application just didn't have the detail. Larger cities like Copenhagen, Berlin, and Paris weren't any problem – but some of the bus or trams in the Netherlands or Belgium just didn't seem to get covered consistently.

Stochastic testing and developer tools

Intermittent issues, or bugs with intermittent reproducibility, are some of the most frustrating issues to track down. In past experience, these kinds of bugs are often multi-thread race conditions or a failed assumption about something that’s happening asynchronously that you thought was synchronous (or just expected to happen quicker). Tracking these down is immensely time consuming, and most of the existing test frameworks and infrastructure isn’t really well suited to help.

A common pattern that I’ve seen (and adopted) is just repeating the tests again and again to try and flesh out the error, (hopefully) with some automation help. I don’t know a name for this process, so I call it “stochastic testing”. Fortunately, it’s generally pretty easy to set something like this up with with continuous integration products (i.e. Jenkins), but while that automates the task of repeating the specific tests, it doesn’t really help you deal with the results and data – and that definitely ends up being the challenge.

There’s a continuum. The simplest is just re-running a single test because you’re seeing an error. I expect most people working with a CI system have seen this, even if they haven’t tracked it down or used stochastic testing to help identify the problem. It’s that time when maybe your build cluster is overburdened, and instead of hunting down the bug, someone tacks in a sleep(10) into a test script. As much as it pains me to admit it, sometimes just adding that sleep() call is the right answer – often you’re waiting for a database to be live, or a connection to finish establishing, and there just isn’t completely causal traceability to hook into know for sure when something is ready to be used. A common secondary theme here is seeing people make a pre-test that validates what should be invariants prior to the actual test running to avoid errors. It’s more work, often slower, but it gets the job down – and reliability in tests is a darned worthy goal.

Then there’s the classic multi-threading race/condition bug, or maybe a race condition between multiple processes that are expected to be used in coordination, but perhaps don’t always check. I’m seeing the later more and more with distributed systems being a common way to build services. A quirk of timing may be causing an exception somewhere in a complex interaction or logic chain, and without knowing everything about all the components, the most expedient way to find it is to just re-run the tests – maybe 10, maybe 1000 times, to find and expose the flaw. Here’s where the data about what you’re doing starts to become tricky – your environment is (hopefully) fairly controlled, but you’re liking going to have logs from multiple separate processes, output from the testing process itself, and all of this compounded by the number of times you’re running the test as well as the need to distinguish the tiny bits that are meaningfully different in these various test runs.

This is also about the level that I’ve found the most benefit in just tracking the test results over time – with a given build, repeating the same sequences, and reporting the percentage of failures and the failures over time. Especially when you’re working on a “release” and stabilization, the time to do these tests pay off. You can verify you don’t have memory leaks (or memory retention cycles, as the case may be), or check for reactions in the system that you didn’t anticipate (like it slowing down after running for a while due to naive algorithm choice). This is also where I like to put in micro-benchmarks, or even full benchmarking code – so I can look at relative performance trade offs between builds.

And the continuum extends right out into running in a public beta test, or perhaps production. In these cases, maybe there isn’t tests – it’s just straight up usage, which complicates tracking down the issues even more. And in these scenarios, you often only get a traceback or maybe a crash report with a little more state detailed. By several measures, this is the most common scenario, and the one that has some explicit tooling created to help with the data tracking and analysis. In 2014, the Mozilla test organization posted some details about their mass collection and processing of crash reports. The project called Socorro has evolved since, and there are commercial products that have similar infrastructure for collecting failures and providing a useful view into what would otherwise be overwhelming aggregate data. TestFlight, Sentry, Rollbar, and Crashlytics all come to mind, focused on either web services or mobile device crash reports. Quite a number of vendors also have their own “on failure, phone home” systems with various levels of detail or metrics available to provide support for their products. At the very least, it’s common to find a “generate diagnostic output” for support from a purchased product.

What I haven’t seen outside of some custom work is the tooling to collect the output of repeated testing, logs of processes, and pro-actively do analysis to provide a state/status dashboard for that testing. When I was at Nebula, the engineering infrastructure team built up a large regression testing infrastructure. Running some of the acceptance tests in a stochastic mode – repeatedly and reporting on the error percentages – was a common practice and a good measure of how we were progressing. I do rather wish that was more common (and perhaps easier) across a broader variety of technologies.

As a side note, one of the really useful details I found to highlight race condition bugs was running tests with constrained resources. It started as an a “accident” – I made the cheapest/smallest VM I could so I could have a lot of them, and in using those for the slaves in a CI system, I found far more issues than we ever did with developer machines, which tended to have a fair amount of resources. Constraining the CPU, and sometimes the memory, really highlighted the issues and could often change the sequencing of operations sufficiently to reproduce some of those hard to find race condition bugs.

The fundamental problem is getting information out of your tests and testing system to determine what’s gone wrong so you know how to fix the problem. This often resulted in culling through verbose logging and hand-correlating events. Things like distributed system tracing, profiling of the relevant systems, and monitoring of resource utilization – could have helped significantly.

Doing these processes highlights a trove of other errors that obscure the real issues. In my own experience, I confused myself (and others) by bundling too many things in a single test, or that the testing framework I used (often a unit-testing framework) did not include hooks or tooling to help capture the information needed. And finally, of course, there’s always the issue of hardware failing – drives or RAM failing in one machine, leading to both false positives and negatives in testing. In distributed systems, a temporary network blip had the same effect – often causing a ripple of otherwise unreproducible errors while testing.

Having a consistent means of collecting, analyzing, and exposing this kind of information would be a huge boon in both development and debugging. There are many variations in this space, it’s easy to find any one piece, and quite challenging to assemble a holistic solution. I’ve wondered if some hosted (or open source) data analysis solution would work. A mix of collecting metrics, events, logs, and potentially even tracing data, as well as having some knowledge of the evolution of builds/source control changes and testing that’s happening to help track down issues.

I keep circling back to this, noodling and considering how to tackle this problem. It’s sort of “missing developer tooling” for me. What would make tracking down these issues easier? What would be good metrics and questions to answer that can provide a sense of progress over time? How to reasonably represent the current “state” of quality in the face of constantly changing system? I have lots of questions, and no clearly obvious answers – just musings and theories right now. If you have something you’ve used that helped identify these kinds of problems, I’d love to know about it.

WordPress for IOS

I use WordPress as a blog engine. I started using it over 12 years ago – at first I hosted it myself, and a two years ago I shifted to using wordpress.com's hosted service, with which I've been very happy. I do all the editing through a browser – until now.

I tried out the WordPress IOS app some time ago, but the editing mismatches and inability to provide a consistent writing experience with the browser based interfaces left me immensely frustrated. I deleted it, and just using Safari or Chrome. Not the greatest experience on a mobile device, but better than nothing. When I was walking about the past couple of months, I heard they had a new editor, so I decided to give it a shot.

Long story short, the new editor (code named Aztec) is a tremendous boost over the existing system, and even in beta a far better experience for me. If you're publishing to a wordpress blog, then I'd encourage you to give it a shot. It's become far easier to include photo's, embed markup, and manage the layout of the writing than it's been in ages.

It's not perfect, but I wouldn't expect it to be in beta form. The means of setting up a scheduled post is still awkward as hell, and selecting categories and other ancillary metadata for a post is fairly constrained. But the editing experience is so far and advanced an improvement that I'd recommend using it immediately.

Ship tests with the code

It is conventional wisdom (has been for years) that you don’t want to ship your tests with your code. When I was doing quite a lot of build engineering, there was always the step in creating whatever packaging that separated the test code from the production code. Maybe that was excluding packing some java classes into a jar, or just shifting off a subdirectory from python, ruby or javascript so that the packaged library didn’t get “bloated” with the tests.

I think we should reconsider that stance…

I’m not sure what’s changed my mind about this, or that I have a new concept fully formed. Some of this extends the concept of healthz that Kelsey Hightower talked towards at Monitorama in 2016. Some of it is continued thinking on fully autonomic systems that I’ve written about in this blog over the past year:

My own experiences with building and integrating tests into CI and CD pipelines in a number of languages reinforces that viewpoint. I see clear benefits to distributed development teams using TDD or tests as service contract validation. I first used this sort of concept some 17 years ago at the startup Singingfish to validate service contracts for a pipeline to build data for a streaming-media search engine. I’ve re-used “tests as a validation” in nearly ever position since then.

There is a huge overlap in value between tests written during the development process and what you need to do to when using debugging and isolation to hunt down a problem with your code is running in production. Ironically it is usually when tests break that you realize that your own testing solution may not be making it easy to debug and isolate a relevant problem – which highlights a good time to refactor the tests! (For the “test pedants” out there, I’m speaking about functional or integration tests) I’m distinguishing functional and integration tests from unit tests. Unit tests are often too granular to be beneficial for the idea I’m thinking about, and the pieces that are changing as your code runs – service status, persistent backends, etc are often better covered by slightly higher level tests.

The challenge is that there isn’t a consistent way to package and deploy tests that work against a system, and there are complications in that sometimes what you want to test and validate will change the state of the system. Or perhaps better stated, every language, and nearly every deployment methodology, has it’s own method – and there’s a very diverse set of conventions with unfortunately little overlap.

Container deployments are shifting the granularity of deployment to a level where tests relevant to a single container as a diagnostics and validation system seem not only feasible, but are getting to be a damned good idea. The concept of healthz, health checks, and liveness probes that are included with container operations systems are some of these tests. If you’re deploying with Kubernetes, Nomad, or some of these newer tooling sets, you may have already built some of these API endpoints into your code.

I can go all star trek and start suggesting “diagnostic levels”, assigning potential impact and requirements for running the tests and what it means for systems. If you have systems like hystrix, or some other circuit breaker concept, in your distributed system and or if your software is generally antifragile, you might be able to run these diagnostics fairly frequently with minimal notice or impact. Chaos Monkey (and related tools) that Netflix popularized to abuse distributed systems is a variant of this kind of validation.

Perhaps there could be a consistent means by which we can expose the ability to invoke these diagnostics. A controlled interface to run diagnostics as a core and functional part of the services we’re building. A diagnostics REST API, or gRPC endpoint. The read-only simple version could extend on something like the prometheus metrics format so that multiple higher order systems can read and interpret the information. More complex APIs using POST or some RPC mechanism to invoke more intrusive levels of diagnostics when needed.

I don’t know what form it could or should take, but something that could be invoked by Docker, Marathon, Kubernetes, CloudFoundry, Heroku, or even the upcoming “serverless” frameworks like OpenWhisk or Fission, would be the goal. Something to provide information back to the humans building and responsible for running the service to understand what’s functional and not. We’re not even at healthz yet, but we could be, there and beyond it: writing and shiping software that has the capability to validate it’s own functionality.

 

Walk about

It’s been all quiet here since I ran out of queue’d blog posts back in April. I’ve been “on the road”, sometimes in a very real sense, and mostly disconnected for the past several months. I’m back home now, cleaning up and recovering from the extended trip. When I started on this trip, I thought it was perhaps about resetting my interests and meeting some new people with different backgrounds and from difficult cultures. What I found is that my interests haven’t varied all that much from when I started, and while we did meet some amazing people, made a few new friends, and saw some amazing sights. In fact, if anything, it was art and sketching that came to the forefront.

I am “interested from a distance” (meaning I did not want to create it myself, but I very much appreciate the work) of architecture and buildings – and historical, of medieval, buildings really capture my interest. I suppose it’s not surprising that my sketching followed in that interest, and I took some time to work specifically on pencil sketching as well as pen and ink, starting to build up some means of capturing and hinting at textures. I still completely suck at human figures, but think I made some real progress on trees and organic landscapes.

A lot of our recent travel was around smaller, preserved towns in Germany and France – and cobblestones turned into my own personal nemesis to represent, so I selected some of the sketches to post here. I pushing some panorama’s into facebook (they display really well there) and more general photo’s onto twitter, so while I don’t usually mention those sites from here, if you’re interested in some of those photos, check out https://www.facebook.com/ccjoe or http://twitter.com/heckj.

I’m mostly recovered from jet-lag now, and sorting through the ideas and journaling I did in between sketching, sight-seeing, and abusing the languages of dutch, french, german, and icelandic. (Icelandic language is really cool, this was my first real introduction to it). I’ll have more posts in the coming weeks as I sort out thoughts and ideas, and re-read my scribblings in the sketchbook.

Wading back into the Cocoa Frameworks

I presented last night at Seattle Xcoders, first time in a number of years – last time was when I was still working for Disney. The presentation was a rehash of the talk I did at the Swift Cloud Workshop a couple of months ago, and sharing a bit of what I’ve learned from my bug fixing and “getting into swift” learning by working on the Swift Package Manager.

I have a few more fixes and bugs I want to tackle, but for the past two weeks I focused on getting back into the cocoa frameworks; relearning and catching up. I was last using these libraries about 7 years ago. My oh my, the world there as seriously changed! There are some nice things, and some not so nice things – great technical advances and a couple of head-slappers.

First thing that’s obvious is the API surface area has grown enormously. All the pieces I worked with previously are there, and nearly double that as well with new libraries and frameworks. The frameworks are akin to tree rings too – you can see the generational aspects of when the API’s were first developed in the method signatures – some using NSError callbacks, some using blocks, and less consistency than you’d expect from a decade of evolution. Or perhaps exactly what you’d expect from a 15+ years of organic growth and language evolution.

Alas the documentation hasn’t kept up with the API surface area – even from external publishers. Where I used to be able to know 90% of the key API structures from something like the Big Nerd Ranch’s IOS Programming guide, it only covers an introduction today – and misses entirely some of the things like Storyboards and anything in depth with Autolayout. Apple’s documentation has even less structural organization around it, still embedded and (mostly) technically correct, but extremely limited. Without the sample code and WWDC videos they publish to tease it apart, I would be lost – as most of the documentation is written with the expectation that you already have the knowledge you’re seeking on how to use the frameworks.

I also find it immensely frustrating that you can’t download the various guides they publish to something like iBooks and read them offline or on the side with an iPad. There is contextual help in Xcode, but the screen real-estate is so cramped on my laptop that I’m often keeping an iPad on the side with some documentation (or that big nerd ranch book propped open) while I’m digging and learning.

When Apple enabled their forums, I thought that might help with some of the gaps, but in reality it’s StackOverflow that’s made the difference. The Q&A site often includes sample code in detailed questions and answers from a community helping itself, and it isn’t uncomon for the answers there include references back to the formal documentation. That linkage provides the “explanation” for that otherwise arcane but technically accurate wording in the Apple provided documentation.

And finally there’s Xcode – I still see the fragments of the separate interface builder in references, but it’s all a single app now. The debugging and diagnostic tools provided by Apple are absolutely amazing, especially some of the recent advances visually displaying view and memory debugging, and the whole “Instruments” tool chain that have been my gold standard for understanding how a process is operating since I started working on other platforms those 5-7 years ago. I keep trying to recreate those capabilities using whatever native tools I can find on the other platforms just to see what’s happening.

I am not as much a fan of the design tooling. The storyboards views and complex inspector setup, as well as the “fiddly magic” of knowing what to click, where to click, and how to link things together isn’t well coordinated or even consistent. I think it represents the steepest learning curve to developing IOS apps, which is frustrating, because you can tell it’s trying to make it easier.

Also frustratingly, it’s not internally consistent from a composition point of view – so there’s times when you just need to wipe out a whole view controller and recreate it  rather than starting with something small and building up to what you want.

When IOS first came out, I built and taught some workshops with O’Reilly to get started on it – with the complexity of the APIs, development setup, and learning the technologies you need to use to get started today I think it would take weeks for a person completely new to the Apple ecosystem.  I’m not even sure a week-long bootcamp style setup would do more than introduce you to the barest minimum needed to get up and running.

I’m not at all surprised at the focus on technologies like React Native for leveraging web development knowledge into mobile development, although that’s got to make debugging a comparative nightmare. Not my interest, to be honest, because I wanted to see what was possible (and new) with the native capabilities. But I can see the appeal in leveraging knowledge you already have elsewhere.

spelunking swift package manager – Workspaces

I finished up my last bug fix and tackled another. SR-4261 took me deep into some new areas of the code base that I wasn’t familiar with. I did more spelunking to have a decent clue of what the code was doing before making the fix. The results of the spelunking are this post.

Many of the recent changes to SwiftPM have been about managing the workspace of a project – pinning dependencies, updating and editing dependencies, and handling that whole space of interactions. These are exposed
as commands under swift package, such as swift package edit and swift package pin.

In the codebase, most of the high level logic for these commands resides in the target Workspace.

If you want a map of all the targets in the project, I included a graph of targets in my first round of spelunking swift package manager.

There are a number of interesting classes in Workspace, but the ones I focused on
were related to manifests and managed dependencies: Manifest and ManagedDependency. These two are combined in Workspace in DependencyManifest which represents SwiftPM’s knowledge of the workspace, its current state, and provides the means to manipulate the workspace with commands like edit and pin.

When swift package manager wants to manipulate the workspace, the logic
starts out loading the current state. This pretty consistently starts with loading
up the main targets by invoking loadRootManifests. This in turn uses the manifest loading logic and an interesting (newish mechanism to SwiftPM) piece: a DiagnosticsEngine. The DiagnosticsEngine collects errors and can emit interesting details for tooling wanting to provide more UI or feedback information.

loadRootManifests loads this from a list of AbsolutePath that gets passed in from the swift CLI – generally the current working directory of your project. In any case, loadRootManifests returns an array of Manifest, which is the key to loading up information about the rest of the workspace.

The next step is often loadDependencyManifests that returns an instance of DependencyManifests. This does the work of loading the dependencies needed to create the holistic view of the project to date and load the relevant state. Loading the ManagedDependencies leverages the class LoadableResult. LoadableResult is a generic class for loading persistence from a JSON file – Pins and ManagedDependencies are both loaded using it. In the case of ManagedDependencies, it loads from the file dependencies-state.json, which includes loading up the current state, current path, and relevant details about the repository. That file, along with validating the relevant repository exists (using validateEditedPackages) at the correct location on disk, also indicates any packages in the “edit” state.

Pins work a bit differently, being stored in the source path. This is to allow them to be included in the project source in order to concretely specify versions or constraints to versions for each dependency. The ManagedDependencies are maintained in the
working directory for builds, not expected to be in source control, and represent the state of things on your local machine.

The way that SwiftPM handles dependency resolution is by using a collection of RepositoryPackageConstraint and a constraint solver to resolveDependencies. The DependencyResolver is it’s own separate thing under the PackageGraph package. I have not yet dug into it. Most notably, The DependencyResolver will throw an error if it’s unable to resolve the constraints provided to it – and that’s key to the heart of the bug SR-4261, which is about adding in missing constraints for edited packages when invoking the pin command.

Using 3 Tiers of Continuous Integration

A bit over a year ago, I wrote out Six rules for setting up continuous integration, which received a fair bit of attention. One of the items I called out was the speed of tests, suggesting to keep it to around 15-20 minutes to encourage and promote developer velocity.

A few weeks ago, the team at SemaphoreCI used the word “proper” in a similar blog post, and suggested that a timing marker should be 10 minutes. This maps pretty directly to a new feature they’re promoting to review the speed of their CI system as it applies to your code.

I disagree a bit with the 10 minute assertion, only in that it is too simplistic. Rather than a straight “10 minute” rule, let me suggest a more comprehensive solution:

Tier your tests

Not all tests are equal, and you don’t want to treat them as such. I have found 3 tiers to be extremely effective, assuming you’re working on a system that can have some deep complexity to review.

The first tier: “smoke test”. Lots of projects have this concept and it is a good one to optimize for. This is where the 10 minute rule that SemaphoreCI is promoting is a fine rule of thumb. You want to verify as much of the most common paths of functionality that you can, within a time-boxed boundary – you pick the time. This is the same tier that I generally encourage for pull request feedback testing, and I try to include all unit testing into this tier, as well as some functional testing and even integration tests. If these tests don’t pass, I’ve generally assumed that a pull request isn’t acceptable for merge – which makes this tier a quality gate.

I also recommend you have a consistent and clearly documented means of letting a developer run all of these tests themselves, without having to resort to your CI system. The CI system provides a needed “source of truth”, but it behooves you to expose all the detail of what this tier does, and how it does it, so that you don’t run into blocks where a developer can’t reproduce the issue locally to debug and resolve it.

The second tier: “regression tests”. Once you acknowledge that timely feedback is critical and pick a marker, you may start to find some testing scenarios (especially in larger systems) that take longer to validate than the time you’ve allowed. Some you’ll include in the first tier, where you can fit things to the time box you’ve set – but the rest should live somewhere and get run at some point. These are often the corner cases, the regression tests, integration tests, upgrade tests, and so forth. In my experience, running these consistently (or even continuously) is valuable – so this is often the “nightly build & test” sequence. This is the tier that starts “after the fact” feedback, and as you’re building a CI system, you should consider how you want to handle it when something doesn’t pass these tests.

If you’re doing continuous deployment to a web service then I recommend you have this entire set “pass” prior to rolling out software from a staging environment to production. You can batch up commits that have been validated from the first tier, pull them all together, and then only promote to your live service once these have passed.

If you’re developing a service or library that someone else will install and use, then I recommend running these tests continuously on your master or release branch, and if any fail then consider what your process needs to accommodate: Do you want to freeze any additional commits until the tests are fixed? Do you want to revert the offending commits?  Or do you open a bug that you consider a “release blocker” that has to be resolved before your next release?

An aside here on “flakes“. The reason I recommend running the second tier of tests on a continuous basis is to keep a close eye on an unfortunate reality of complex systems and testing: Flakey Tests. Flakes an invaluable for feedback, and often a complete pain to track down. These are the tests that “mostly pass”, but don’t always return consistently. As you build into more asynchronous systems, these become more prevalent – from insufficient resources (such as CPU, Disk IO or Network IO on your test systems) to race conditions that only appear periodically. I recommend you take the feedback from this tier seriously, and collect information that allows you to identify flakey tests over time. Flakes can happen at any tier, and are the worst in first tier. When I find a flakey test in the first tier, I evaluate if it should “stop the whole train” – freeze the commits until it’s resolved, or if I should move it into a second tier and open a bug. It’s up to you and your development process, but think about how you want to handle it and have a plan.

The third tier: “deep regression, matrix and performance tests”. You may not always have this tier, but if you have acceptance or release validation that takes an extended amount of time (such as over say a few hours) to complete, then consider shifting it back into another tier. This is also the tier where I tend to handle the (time consuming) and complex matrixes when they apply. In my experience, if you’re testing across some matrix or configurations (be that software or hardware), the resources are generally constrained and the testing scenarios head towards asymptotic in terms of time. As a rule of thumb, I don’t include “release blockers” in this tier – it’s more about thoroughly describing your code. Benchmarks, matrix validations that wouldn’t block a release, and performance characterization all fall into this tier. I recommend if you have this tier that you run it prior to ever major release, and if resources allow on a recurring periodic basis to enable you to build trends of system characterizations.

There’s a good argument for “duration testing” that sort of fits between the second and third tiers. If you have tests where you want to validate a system operating over an extended period of time, where you validate availability, recovery, and system resource consumption – like looking for memory leaks – then you might want to consider failing some of these tests as a release blocker. I’ve generally found that I can find memory leaks within a couple of hours, but if you’re working with a system that will deployed where intervention is prohibitively expensive, then you might want to consider duration tests to validate operation and even chaos-monkey style recover of services over longer periods of time. Slow memory leaks, pernicious deadlocks, and distributed system recovery failures are all types of tests that are immensely valuable, but take a long “wall clock” time to run.

Reporting on your tiers

As you build our your continuous integration environment, take the time and plan and implement reporting for your tiers as well. The first tier is generally easiest – it’s what most CI systems do with annotating in pull requests. The second and third third take more time and resources. You want to watch for flakey tests, collecting and analyzing failures. More complex open source systems look towards their release process to help coral this information – OpenStack uses Zuul (unfortunately rather OpenStack specific, but they’re apparently working on that), Kubernetes has Gubernator and TestGrid. When I was at Nebula, we invested in a service that collected and collated test results across all our tiers and reported not just the success/failure of the latest run but also a history of success failure to help us spot those flakes I mentioned above.

 

Using docker to build and test SwiftPM

I sat down to claw my way through blog posts, examples, and docker’s documentation to come up with a way to build and test Swift Package Manager on Linux as well as the Mac.

There are a number of ways to accomplish this; I will just present one. You are welcome to use it or not and any variations on the theme should not be difficult to sort out.

First, you’ll need a docker image with the Linux swift toolchain installed. IBM has a swift docker image you can use, and recently another was announced and made available called ‘swiftdocker‘ which is a Swift 3.0.2 release for Ubuntu 16.04. I riffed on IBM’s code and my previous notes for creating a swift development environment to build the latest master-branch toolchain into an Ubuntu 16.04 based image. If you want to follow along, you can snag the Dockerfile and related scripts from my vagrant-ubuntu-swift-dev github repo and build your own locally. The image is 1.39GB in size and named swiftpm-docker-1604.

SwiftPM is a bit of a special snowflake compared to other server-side swift software – in particular, it’s part of the toolchain itself, so working on it involves some bootstrapping  so that you can get to the moral equivalent of swift build and swift test. Because of that setup, you leverage a script they created to get run the bootstrapping: Utilities/bootstrap.

SwiftPM has also “moved ahead” a bit and leveraged newer capabilities – so if you want to build off the master branch, you’ll need the swift-3.1 toolchain, or the nightly snapshot release to do the lifting. The current 3.0.2 release won’t do the trick. The nightly snapshots are NOT released versions, so there is some measure of risk and potential breakage – it has been pretty good for me so far – and necessary for working on SwiftPM.

On to the command! To build and test swiftpm from a copy of source locally

docker run -t --rm -v $(pwd):/data:rw -w /data swiftpm-docker-1604 \
/bin/bash -c "./Utilities/bootstrap && .build/debug/swift-test --parallel"

To break this down, since Docker’s somewhat insane about command-line options:

  • -t indicates to allocate a TTY for this process
  • --rm makes the results of what we do ephemeral, as opposed to updating the image
  • -v $(pwd):/data:rw is the local volume mount that makes the current local directory appear as /data within the image, and makes it Read/Write
  • -w /data leverages that volume mount to make /data the current working directory for any commands
  • swiftpm-docker-1604 is the name of the docker image that I make and update as needed
  • /bin/bash -c "..." is how I pass in multiple commands to be run, since i want to first run through the bootstrap, but then shift over to using .build/debug/swift-test --parallel for a little more speed (and a lot less output) in running the tests.

The /bin/bash -c "..." bits could easily be replaced with ./Utilities/bootstrap test to the same end effort, but a touch slower in overall execution time.

When this is done, it leaves the “build for Linux” executables (and intermediate files) in the .build directory, so if you try and run it locally on a Mac, it won’t be happy. To be “safe”, I recommend you clear the .build directory and rebuild it locally if you want to test on MacOS. It just so happens that ./Utilities/bootstrap clean does that.

A few numbers on how (roughly) long this takes – on my mid-2012 MacBook Air

  • example command above: 5 minutes 15.887 seconds
  • using /bin/bash -c "./Utilities/bootstrap test": 6 minutes, 15.853 seconds
  • a build using the same toolchain, but just MacOS locally: 4 minutes 13.558 seconds