Google Project Zero delivering benefits

Most people won’t know about Google Project Zero – but it’s worth knowing about. I learned about it a few months ago, although the effort was started back in 2014 after the now-infamous heartbleed security vulnerability. It is an effort to focus and drive on a particularly nasty set of bugs to identify – low-level software exploits, funded and hosted by Google. The wikipedia article on Google Zero is pretty good for the history.

This morning as I was applying a software update, I scanned through the release notes, and quite a number the set I reviewed were security patches informed through CVE’s generated or found through Project Zero. As an effort to support and bolster the lowest level of software infrastructure, I’ve got to applaud it.

micro.blog: Looking forward for a new way to have a conversation

I’ve had my twitter account since 2007. When I joined twitter, it was a lovely way to step forward into conversations with a bunch of technologist friends who were mostly Mac (and soon IOS) developers. For me, it was a pipeline to the thoughts of friends and cohorts. It was really similar to the kind of experience I found at stopping by the pub at a developer’s conferences: lots of scattered conversations, forming, happening, and breaking down again. I described it as “tea party conversations” – where you didn’t need, or even want, to hear everything, but you could get a bit of feeling of what was being talked about, join in on the conversation, and step back out when you were done.

That lasted a good long while. During that time I used blogs and RSS readers (like NetNewsWire) to keep up with the longer form content. Tutorials, opinion pieces, demo links, whatever – stuff that didn’t fit in a small text message style conversation. I was using several devices at once; when the landscape changed and most of the RSS readers accumulated to Google Reader, I went with them. I still used NetNewsWire but sync’d all the data with Google Reader and also used the Google Reader web application.

Longer form writing started to die down – Tumblr rolled through, lots of folks posted to Facebook where they might have written something longer on a Blog, and Twitter kept being prolific at getting me pretty good conversation points. More and more, twitter became the place I learned about events, found links to longer form articles to read, and keep track of what was happening in quite a variety of technical ecosystems.

Then in 2013, Google kicked a whole bunch of folks (including me) in the nuts: It shut down Google Reader. They gave a lot of time to transition, they did it well and communicated fairly, but in hindsight that shutdown really collapsed my reliance on RSS and getting to longer form writing through RSS. I kept up with Twitter, LinkedIn was doing some technology news sharing, and I wandered and struggled with tracking news and information I was interested in, but was mostly successful.

In the intervening years, Twitter has arguably become a cesspool (so has Facebook to be fair). Harassment, fake news, overt advertising and with the service’s algorithms showing me only portions of what I wanted in both cases. What they thought I wanted to hear; it became more of an echo chamber than I’m comfortable with. Where it was once the pub where I could step into a conversation, learn something interesting or hear about something I should check out later and head on – it became incoherent shouting over crowds.

I intentionally followed a wide variety of people (and companies); I wanted the diversity of viewpoints. With the evolving algorithms, injection of advertising that looks like my friend’s messages, and generally just the sheer amount of noise it became overwhelming. The “signal to noise ratio” is really quite terrible. I still get a few nuggets and have an interesting conversation with it, but it is a lot more rare. Friends have dropped from twitter due to their frustration with it, harassment issues for some of them, and others have locked down their posts, understandably so.

So when I heard that Manton Reece was taking a stab at this game with micro.blog – getting back to conversations using open-web technologies, I was thrilled. I’ve been quietly following him for years; he’s a level headed guy with interesting opinions. A good person to listen to in my “tea party conversations” metaphor. Manton has his project idea up on Kickstarter, and got the response I think it (and he) deserved: resounding. It is fully funded, a couple times over as I write this. I backed it – as did over 1750 other folks.

Even though it’s funded, if you’re in the same situation I am with twitter and facebook, take a look at the micro.blog kickstarter project and consider chipping in.

I don’t know if it’ll be a path to solve some of the problems that I’ve experience with Twitter, and to a lessor extent with Facebook. I want a place where I can carry on conversations that isn’t tightly bound into one of those companies. I want some tooling where I don’t have to have a really high barrier to get different opinions and thoughts on topics I’m interested in. I want to “turn down the volume” quite a few notches. I hope that micro.blog does some of that.

I’m fairly confident that the people I started following on twitter back in 2007 will be in micro.blog, sharing their opinions (and blogs). I’ll be on it, and sharing my thoughts and opinions, and hopefully engaging in some interesting conversations.

Kubernetes closing on full loop performance automation

Some background before I get into what I see Kubernetes community doing.

When I say “Full loop performance automation”, I am talking about scaling up and down services and resources in response to application metrics – often response time, but many different dimensions of metrics are interesting for useful solutions. It is a sort of holy grail for modern applications; it’s not easy, and it is far from consistent or common. And although I’m talking about Kubernetes, this concept is applicable to more than cloud-based services.

it’s a control loop

At the heart this is a feedback control loop. To accomplish this kind of thing:

  • You need data from your application to react against: API response times, or queue lengths are some of the most common – the best are fairly direct measures linked to improved response due to additional resources being made available.
  • You need to capture that data, baseline it, and have some discriminator that will turn that into choices and action – adding a few more instances of the front end to get the better performance or the inverse: nobody is hitting this now, so reduce the number of instances down to a minimum. Most operations has so far has been obsessed with the first, although the second is arguably more valuable for controlling costs.
  • You need a means to wrapping that choice – that discrimination – on to an action or into a control of some form: adding more instances, reducing instances, making relevant system connections and configurations, etc.

This has been feasible for a while, but it’s not been easy. It is dependent on what you can control, often through APIs. There are also limits to the feedback loops; sooner or later you bottleneck depending on what you’re scaling and it’s dependencies. If what you need to scale is not accommodated in your control loop, you can make changes but it’ll have no effect. This sets limits on what you can scale and how far you scale it. Prior to the world of “cloud computing” and IaaS, this was often the number of physical servers you had available, or possibly the memory in those servers, or the amount of underlying raw IO that those servers had to other storage systems. I have had lot of conversations about “IO budget”, “memory budgets”, and more as a key factors in building highly responsive and highly scaled systems.

what we can control through APIs has expanded

With VMware slicing up physical machines into virtual machines and then AWS taking the ball and running with it and creating cloud services – the thing that could be most easily automated was a virtual machine. Automated configuration of virtual machines often followed the pattern of physical machines. In general, the pattern has been assigning a single purpose and not changing it.

In fact, the most common devops tools for VM and bare metal configuration are almost antithetical to the concept or “scaling up and down”, at least within a virtual machine. Puppet and Chef in particular are all based on the concept of “converging to desired state”. State being singular and unchanging; well, until you change the puppet or chef configurations anyway. Not reactive, not responsive – more a “nailed down” thing. Puppet and Chef both kept the boundary of their effects to a single OS instance, most commonly a VM today. And frankly, they slept away the realization from AWS cloud formation, the open source clone that is OpenStack HEAT, Juju, and lately Terraform that the need for management had gone well above managing individual virtual machines and into sets of them.

These practices all stem from the ITIL model. An overly simplistic synopsis being “reducing errors by controlling change”. Originally formed when everything was manually updated and changed, it’s a duct-tape bandage on making a reliable system with us – humans. Human nature not being entirely reliable, more so when not communicating well (or at all) with their peers. The result was recommendations for formalized communication patterns, some of which ultimately became codified by software (originally BladeLogic) and then it’s open source clone/competition of cfEngineChef and Puppet.

As applications spread to both span and require multiple machines (or multiple VMs), scaling was how fast you could spin up and add new servers. Ops folks have been wanting to automate a lot of this scale up and scale down for ages, but never had consistent mechanisms to do so. It is not surprising that one of the first features that Amazon enabled after basic EC2 VM creation and tear down was autoscaling.

there’s some early steps towards this level of automation

Amazon’s auto-scaling setups (and several other systems out there) have enabled simple versions of this concept. It’s been applied extremely successfully where it’s easy (like front-end web servers) – stateless services. And what you scale is more (or less) virtual machines. Which is great, but often not granular enough for a really good of control.

challenges include humans in the process

One of the problem points is the consistency of the decision making process. These operations, as they were codified, were almost always using human as the control points. In addition, they were rarely tested as completely as they needed to be, and completely beholden to the specific installation of the underlying application. The access to create (and destroy) virtual machines (and related resources) are tightly controlled. You can see these controls parroted in the autoscaling systems today.  Juju, Heroku, or CloudFoundry all have the scaling controls as operator commands. A lot of the reason is cost – scaling up means increasing costs, and companies almost always put a person in that control loop.

As the cost of computing continues to fall, the cost of keeping people in these loops increases. Humans are slow: minutes (or sometimes hours if they’re being woken from sleep and getting remote access) to respond to an alert from a monitoring infrastructure, making a choice, and invoking the relevant commands. IBM recognized this years ago and started to promote fully autonomic services. The concept of full loop performance controls has been lurking in the corner and desired by many in SRE and operations focused jobs. As we’ve been shifting the responsibility to run the service you write to developers, it is also spreading to all levels of development teams.

and the granularity of what we can automate and scale

The other problem, which I hinted at earlier, has been the level of granularity of what you can control. A bare metal machine, or even VM, is too large – too granular. With the shift towards containers as a means of deployment, the granularity is hitting a much better spot: per OS process, not VM. Adding more worker processes, or removing some – and the concept of groupings by service is about level of granularity for these control loops.

The container ecosystem also supports scaling as an operator command: docker, dcos/marathon and kubernetes, mirroring quite a bit of the AWS Autoscaling concepts. Marathon and Kubernetes offer autoscaling APIs, and there’s companies that are trying to build a bit of autoscaling onto Docker.

You can scale based on CPU consumption readings or requests/second from a load-balancer. That often does well for stateless front-end web servers, but you often want to scale to other measures – like adding on temporary cache capacity, or manipulating the number of asynchronous background process workers.

what kubernetes is doing

Kubernetes has a solid architecture for metrics and in general have been very aggressive about getting all the required primitive pieces in place for truly amazing levels of automation. The Kubernetes community has been pretty clear about how they want metrics to operate. Over the holidays, I saw an API proposal that supported custom metrics, which really brought it home for me.

This still requires that we build something into our applications to expose the metrics we want to react to, and we’ll then need to expose those through the service (or some kubernetes object) in order to react to it. Maybe a variant of healthcheck to either your application (healthz concept that Kelsey Hightower has been presenting) or to a sidecar container that “bolts on” a level of monitoring to previously built applications exposes your select application metrics.

Once a means to collect custom metrics is in place, something needs to make the decisions – and the structure of that is already solidly in Kubernetes. The service and controller concepts with Kubernetes provide a good basis for implementing these control loops, they just need more (and different) information.

all this enables more complex optimizations for running your applications

Once this is enabled, new information can be factored into the decision process and become part of the scheduling and balancing setup: balancing multiple services and caches against a cost to run your system; pre-scaling or caching against predicted consumption that would otherwise bottleneck your application; or more complex tradeoffs between services that need to be responsive right now vs. scaling up background processes to handle work at low-usage (or low-cost) times. There are more possible business and use cases here, but they all require similar things – more and different information from the application itself, as well as information from the environment hosting the application.

Mesosphere with Marathon and DC/OS is pushing on some of these same points, but so far hasn’t included de-facto standards or means of piping data from an application to their system to make this feedback loop easy. It’s possible, but a bolt-on, to Marathon. I wouldn’t be surprised to see Docker put something roughly equivalent into it’s APIs extending their concept of runtime metrics, especially if things like the Microscaling effort go well. With Docker’s apparent effort to be everything to everyone, I expect they will jump on that bandwagon as soon as they see someone else doing it and being successful.

I’m looking forward to seeing how this evolves, keeping a close eye on it all. With systems like Kubernetes, this could apply to your local datacenter or to any number of cloud providers, and it’s ultimately about helping us run our applications and services more efficiently and effectively.

Distributed home computing

I first saw the Ars Technica article on the Intel Compute Card announced a CES this year. Incredibly skimpy on details of course, with “more coming mid-2017”, I none the less read all the articles on it I could find: gizmodo (good pics and usefully factual), techrepublic (quite a bit more IoT hyperbolic and click-baity), liliputing (more good pics and sparse, but good details), and a few others that were mostly regurgitation without any value add.

Intel’s Fact Sheet on the Compute Card was a bit trickier to dig up, but useful. Of course, I want to have some sense of the tech specs now – how much compute, RAM, IO and what kinds of ports will it integrate with, how much power will it use, all the nitty gritty details.

It really looks like Intel is making the play for the “replaceable IoT computing unit” based on some of the pictures, but honestly the first thing I thought of when I saw it was “Who’s going to make a “dock” that you can slot 10, 20 or 30 of these things into?

Make them glowing white glass and the scene deactivating HAL from 2001 comes immediately to mind.

img_0037

It’s wacky-out-there, but a distributed computing backplane of these isn’t entirely insane. We’ll have to see how the economics bear out, and it’s not like there’s an “easy interface” for adding applications into this kind of space like you would with a Mac, IOS, Android or Windows device – but it’s not out of the realm of possibility. Folks are doing the rough equivalent with Raspberry Pi’s, which are probably in the “slightly less capable, and definitely less expensive” realm than the compute cards. Canonical coordinated to do the same kind of clustered thing with the Orange Box which I’m guessing will be closer to the same price point as the same number of cards.

There’s so many reasons why this wouldn’t be a home computing solution, although I’d love to have a mechanism where compute resources could be extended by plugging in a few more modules, and the whole setup operates as a pool of computing resources. For now I’m going to stick with the small stack of NUCs and a cheap netgear GB switch.

Programming puzzles with Swift Playgrounds

Recent travel and the holidays gave me ample opportunity to dig into some programming puzzles that I’ve been meaning to try – in particular, to really walk through the swift language and programming tutorials using the Swift Playgrounds application (specifically on my iPad).

img_0043

The content in the three Learn To Code playgrounds is incredibly well done. The sequences build on each other, refresh and play back what you’ve learned, and above all are fun. Even though I was already very familiar with the concepts, I enjoyed doing each of the exercises in the tutorials simpy because they were interesting, sometimes quite amusing, but mostly because they were fun!

The fact that it was available on an iPad was a definite bonus, letting me poke and prod at experiments and ideas while I was on a plane, or had some downtime for an hour. It was super easy to pick up, set back down, and then pick it back up a day or two later with no real loss of context or understanding.

A notable downside that I hit was that some of the help and documentation wasn’t available directly within the application (or apparently wasn’t cached by the application) and required the Internet and a browser to access.

For a newcomer to swift (or programming), when you did something egregious in playgrounds, the error messages could be exceptionally strange. That’s a place where you might be pretty confused, simply because the error messages don’t make a tremendous amount of sense. For example, if you torture a for loop with a typo:

for 1 in 1...3 {
    counter += 1
}

The first error message that comes available in a list of quite a few is “Expected Pattern”, which honestly means incredibly little even to me. The most useful message is the last one in the set: “Type ‘Int’ does not conform to protocol ‘Sequence'”, and even that assumes quite a bit of knowledge – like what a type is, what a protocol is, and that the 1…3 thing is leveraging a specific protocol called ‘Sequence’.

The whole concept of Playgrounds are very familar to me from time I spent in the Python community learning and using Python Notebooks (now more often called Jupyter notebooks) since they’re no longer just bound to the python language. The user interface metaphor is sort of a magic notebook, the first example I ever remember seeing was Mathematica in the computer labs at the University. These days, I see all sorts of scientific example papers that leverage this technology as a sort of “live and reproducible programming lab notebook”. A bit of a magic notebook where you can mix writing and programming freely.

Swift Playgrounds continues with this pattern, limited to the swift language, but still flexible enough that I could see some interesting capabilities for learning and teaching based on it. I haven’t yet really dug in to how to create playground content, certainly not to the amazing level that the Learning To Code playgrounds have enabled and embraced. I suspect it could make some really incredible walk-through API tutorials possible, which I’m looking forward to poking into more down the road.

In the mean time, I’ve got more puzzles to solve in the existing content…

UPDATE: Apple has the video showing off the Swift Playgrounds application available through their WWDC Videos, as well as a video from last year’s WWDC on how to author rich content for Playgrounds.

China, October 2016

I’ve been quiet on this blog for a few weeks, mostly due to travel. I recently completed a wonderful trip to China, a tour we’ve been planning for the better part of several years. I’m sure I’ll get back to technical topics but for now I wanted to put up some of the photos from our amazing trip.

Between Karen, my mother, and myself, we took nearly 1500 photos. Picking the iconic or even relevant photos from that set of memories… well, just wow. That’s tough. Fair warning, this will be almost all photos…

IMG_0019.jpgObligatory selfie with my love, at the Temple of Heaven in Beijing

I’ve had a side interest in the art of stone carving for a long time, and I got to definitely enjoy some exquisite examples while I was at the Temple of Heaven

IMG_0021.jpgIMG_0039.jpg

IMG_0072.jpg

That extends into architectural details and ornamentation.

IMG_0061.jpgIMG_0077.jpg

The details that we saw are innumerable. Just as representative, perhaps better representative, of the trip was the sense of vastness and scale. Texan’s may think they “build big”, but really – they’re pretty weak punks when it comes to China.

IMG_0147.jpgPanorama from the Great Wall of China

 

IMG_2901.jpgPanorama from the Forbidden City

IMG_3085.jpgPanorama from Xi’an and the archeological dig where they are unearthing the terra cotta warriors.

But of all the places, I think the most visually stunning wasn’t built. IMG_3299.jpgPanorama from Guangxi, China on the Lichiang river. What’s often called “The gumdrop mountains” in english.

One blog post would never do it justice. 12 days in a whirlwind tour, touching down in quite a few different parts of a hugely diverse country.Screen Shot 2016-11-19 at 10.50.42 AM.png

We had some goofy fun times, such as riding the maglev in Shanghai, and my oh-so-authentic terror of being on a train moving at 431km/hr (it was really cool!)

IMG_0953.jpg

Traveling with our fellow tour mates and getting to know other people with interest in China as well.

IMG_0688.jpgThis is Karen and I with David – youngest member of our tour – at the dig in Xi’ian.

IMG_0859.jpgBeing with family to have this experience (Karen, myself, and mom)

IMG_0423.jpg

Meeting new people who were willing and interested to chat with curious tourists.

IMG_0255.JPGIncluding quite a number of school children who were all very excited to practice a little english and pleasantries.

Quite an amazing trip, walking through that 大門 to see China.

IMG_0049.jpg

Wallowing in research papers

I started my first job at the University from which I graduated. It was an strange time, a light economic depression and “peace” had just broken out. The cold war was effectively over, the Berlin wall came down, and it wouldn’t be long until the Soviet Union dissolved into component states. I previously excelled in my studies of CMOS chip design never useing that skill in any job. Instead what was available was software and system administration. An evening gig at a campus “help desk” is where I started. In that environment, I had access to clusters of NeXT computers, SGI Toasters with Irix, some standalone IBM RS6000 running AIX, a VM/CMS mainframe, and rooms and rooms of PCs and Macs. You couldn’t ask for a more fertile ground for learning.

One the CS professors doubled down as an early job mentor (as well as being “the security guy” for University): Greg Johnson. When I was first stepping into unix and trying to get a handle on it, he provided some sage advice that stuck with me:

Reading man pages is a skill. It’s something you can practice at and learn, and there is a treasure trove of information there.

For those who don’t know man (short for manual) pages were (are) the built-in documentation to unix – accessible using the command man – they share information from how to use commands on the command line to details about how various software libraries function. You can easily find them – any linux OS probably has them installed, or Macs – just open a terminal or SSH session and use a command like “man ls” and you’ll get a gist of what’s there – as well as why it’s a skill to read them.

I took Greg’s advice to heart and focused on learning to read man pages. It may have been the moral equivalent of learning to read by going through a dictionary, but it worked. I wallowed in man pages, kept following down leads and terms referenced in that green-screen text output. It started as a confused mess of words that didn’t mean anything. After seeing them repeatedly, and wrapping some context around some, I was able to sort out what various terms meant.

Twenty five years later, I find myself learning using the same process. Different set of confusing words, and a different topic to learn, but things are coming in to focus. Over the past six months (well, since Deep Mind won the Go games over Lee Sodel) I’ve been wanting to understand machine learning and the current state of research. The topic is so new, so rapidly unfolding, that other forms of documentation are pretty limited. There are articles and how-to’s out there for Keras, TensorFlow, and others – but understanding the topic is different than learning how to use the tools.

Wallowing in academic research papers is as messy as the man pages were in the “unix wars” era. The source these days is arXiv, which hosts scientific papers in prepublication form as a common repository for academic research and citation. They have a handy web interface that let’s you list the papers for the past week:

https://arxiv.org/list/cs.AI/pastweek

Every weekend, I refresh that page in a browser, browse through the titles, and grab and stash PDFs into evernote for later reading. Every now and then, I find a “survey of…” paper which is a gold mine of references. From there either arXiv or Google Scholar can help seed up finding these additional papers or areas of interest. Wikipedia has been a godsend in turning some of the jargonistic terms into plain english. An AI issue such as vanishing gradient descent gets a lot more tractable in wikipedia, where volunteers explain the issue much more simply than the strict correctness of academic papers.

I hit one of those “magic” papers just last weekend.  Entitled “A growing long-term episodic & semantic memory“, it was one of the papers that stitched together a lot of concepts that had been loosely floating around in my head.

I don’t understand it sufficiently to explain it simply, but I’m getting there…

Embodied AI

I’ve thought for quite a while that some of the critical paths forward in building systems that learn need more complexity and variables in their world. A couple of weekends ago a friend passed me an article on Engadget about an AI contest set in Doom. Given the recent exemplary advances of systems learning to play Nintendo games, it’s a natural successor – that and I think there’s a lot of social overlap bewteen people interested in building AI systems and those who enjoy playing video games. Puzzle solvers all, on various levels.

Part of me is thrilled that we’re taking initial stabs and making challenges to extend state of the art. Another part of me is thinking “Oh great, as if this isn’t going to spread the Terminator existential fear.” I don’t really have any fear of the terminator scenario. Like some others, I think that’s at best a very distant concern in the advancement of AI, and far more driven by PR and media than the key of real issues.

I do have a deep and abiding concern of “bad actors” with the augmentation of AI at their fingertips. For the now and near future, AI is being applied as an augmentation onto people (or companies) – and their abuses of the resources could easily prove (and I think are far more likely) to be more damaging.

Although we’re both farther (and in a long term perspective it will still happen sooner than we expect) to creating a general artificial intelligence, I suspect the way it will be learn will be nearly the same way we do – partially being taught, and partially from experience. Today’s AI systems are the moral equivalent of mockingbirds – good at repeating something that already exists, but with limited capabilities in aplying innovation, or dealing with “surprise” events and learning from them.

I think a far more interesting effort in this space – not quite a click-bait worthy as AI’s playing DOOM – is what Microsoft is calling project Malmo – some mechanisms to start to provide a bit of embodiment in a world more simplified from our own: Minecraft. I haven’t seen much press about the results, contests, etc from opening up Project Malmo, but the PR and blog articles from earlier this summer point to where anyone can get started it, from their open code on github. (I’m still waiting for the Ballmer-esque world to come and take Microsoft back into the bad-ole-days and away from their renewed sanity around using and contributing to Open Source, but it seems to be sticking). I’d love to see some challenges built to Malmo, and with goals different than getting the highest kill count.

Can we versus should we

I made a conscious choice to use the phrase “should we…” when asking a question about technical implications rather than “can we…”. The difference is subtle and very significant.

There is a lot of self-pride in being able to solve the problem in the engineering community. Using probably too hasty a generalization, culturally its a bundle of people who love to solve puzzles. As we’ve been solving problems, some of the side effects or impacts of those problems are really about the organization we’re impacting, or the people consuming our tooling.

I mostly work on open source related tooling and projects these days: back-end or server-side code, distributed systems, micro-services if you want to use the buzzword bingo. In the end though, it’s all about making tools – and tools are for people. Michael Johnson, who makes tools for the folks at Pixar once offered me a bit of advice when I was at Disney making tools for their back-end developers. I’ll summarize what I remember, because the specific words are lost to my memory:

Pay attention to what the tool is doing as much as the technical work of doing it. Tools will shift the balance of an organization and how it works significantly.

That bit of advice has really stuck with me, and is something I think about when setting out on a project or even discussing adding some capability. What is this enabling? Who is it helping? How will they use it? What impact will it have?

It’s very easy to not look beyond the immediate impact of what the result of tool will enable, but it’s critical to make something really effective. And that’s why I try to ask the question “should we” – to set up the context for the discussion of not only what a feature in a tool can do, but what it means in a broader sense. In the end, it’s really just another puzzle – so getting a diverse set of brains working on possible solutions usually ends up with some good solutions, and surprising insights. It’s really a matter of just getting the right questions framed, and then getting out of the way to see what ideas come up.

Bugs: It’s a data problem

A number of years ago, a friend complained to me “Why does everyone seem to want to reinvent the bug tracker?!” That semi-inebriated conversation stayed with me as a question that has since rolled around in the back of my head.

A couple of years ago at Nebula, I was actively triaging bugs for a weekly development meeting. The number of issues logged was up around ~250. It wasn’t just bugs – big and small feature ideas were included, as were the “hey, we took a shortcut here to get this out the door for a customer, let’s get back here and clean this up” stuff. It was the whole mess of “tracking development” stuff, freely intermixed with what was far more appropriate for product management to manage. I was guiding the engineering effort, so I wallowed in this data all the time – I knew it inside and out. The reason that day stood out is that I had just hit my threshold for getting overwhelmed with the information.

It was at that point that I realized that bugs, development ideas, and tracking technical debt was fundamentally a data problem, and I needed to treat it as such to have any semblance of control in using it. Summarization and aggregation was critical, or we’d never be able to make any reasonable decisions about what should be focused on. And by reasonable I mean somewhat objective, because up until then I realized that everyone was counting on me to know all this information and provide the summary – and I was doing that intuitively and subjectively.

I call them “bugs” because that’s easy and I’m kind of lazy, and it’s kind of a cultural norm in software engineering. The software that has been most successful at wrangling this kind of data flow has generally been called “bug trackers”. Bugzilla from back in the day, Trac, Fogbugz, JIRA, or more often these days github issues. Bugs is not technically accurate, it’s all the ephemera from development, and it starts to get intermixed with “Hey, I just want to track and make sure we do this” stuff that may not even be about the product. Once you put in place a system that does tracking that you actually track, it gets used for all sorts of crap.

It gets more confusing because some of the data tracked really are bug reports. Bug reports have narrative included – or at least I hope narrative exists. And free-form narrative  makes it a complete pain in the butt to aggregate, summarize and simplify.

I think that is why many folks start adding on tags or summarization metadata to these lists. The most common are the priority tags – “really important” vs. “okay to do later” markers. Then there is the constant tension between fixing things and adding new things that tends to encourage some teams to utilize two different, but confusingly similar markers like “severity” vs “priority”. In the end, neither does a particularly good job of explaining the real complexities of the collection of issues. At this point, I would personally rather just make the whole damn thing a stack rank, factoring in all the relevant pieces, rather than trying to view along two, three, or more axis of someone else’s bucketing. In practice, it usually ends up getting bucketed – and who (individual or group) does the bucketing can be very apparent when you look though the data.

This is all fine while you’ve got “small amounts” of data. Small amounts in my case being up to about 250 reports. The exact number will be subjective to whomever is trying to use it. So far, I have been writing about being overwhelmed by the volume of this information. Growth over time caused most of that volume – there is always more work than you will ever be able to do. If you accelerate that flow by collecting all the crashes from remote programs – on desktops, or in browsers, or on mobile devices – the volume will accelerate tremendously. Even with a smaller “human” flow, there is a point where just getting narrative and data summarized falls apart and you need something other than a human being reading and collating to summarize it all.

A few years ago Mozilla started some really amazing work of dealing with the aggregate data collection or crashes and getting it summarized. They were tracking inputs on a scale that nobody can handle individually. They built software specifically to collect tracebacks and error messages that they automatically aggregate and summarize. They also set up some thoughtful process around how to handle the information. Variations on that theme have been productized and made into software solutions such as crashalytics (now part of Fabric), Rollbar, or Sentry.

Like its cousin performance metrics that are commonly collected on hosted services (with something like Grafana or Graphite), this is informational gold. This is the detail that you want to be able to react to.

It is easy to confuse what you want to react to, with how you want to react. It’s common when you’re growing a team; you start mapping process around the data. When you’re dealing with small numbers of bugs, it is common to think of each report as an exemplar – where duplicates will be closed if so reported – and you start to track not just what the issue is, but what you’re doing to fix it. (Truth is as the volume of reports grow, you almost never get any good process to handle closing duplicates)

At first its easy – it is either “an open issue” or it is “fixed”. Then maybe you want to distinguish between “fixed in your branch” and the fix included within your release. Pretty soon, you will find yourself wandering in the dead marshes trying to distinguish between all the steps in your development process – for nearly no value at all. You are no longer tracking information about the issue, you’re tracking how you’re reacting to the issue. Its a sort of a development will-o’-the-wisp, and not the nicer form from Brave, either.

The depths of this mire can be seen while watching well meaning folks tweak up Jira workflows around bugs. Jira allows massive customization and frankly encourages you down this merry path. I’m speaking from experience – I have managed Jira and its siblings in this space, made more than a few swamps, and drained a few as well.

Data about the bugs won’t change very much, but your development process will. When you put that process straight into the customization of a tool, you have just added a ton of overhead to keep it in sync with what is a dynamic, very human process. Unless you want to be in the business of writing the tools, I recommend you use the tools without customization and choose what you use based on what it will give you “out of the box”.

On top of that, keep the intentions separate: track the development process and what’s being done separately from feature requests and issues for a project. Maybe it’s obvious in hindsight, but as data problem, it gets a lot more tractable once you’re clear on what you’re after. Track and summarize the data about your project, and track and summarize what you’re doing about it (your dev process) separately.

It turns out that this kind of thing maps beautifully to support and customer relationship tracking tools as well. And yeah, I’ve installed, configured, and used few of those – Helpdesk, Remedy, ZenDesk, etc. The funny thing is, it’s really about the same kind of problem – collecting data and summarizing it so you can improve what you’re doing based on feedback – the informational gold to know where to improve.