Before I get into the content, let me preface this with this is a WORK IN PROGRESS. I have been contributing and collaborating using Open Source for quite a while now, and my recent work is focused on helping people collaborate using open source as well as how to host and build open source projects.
I am a firm believer of “measure what you want to do” – to the point that even indirect measures can be helpful indicators. Most of this article isn’t something you want to react to on a day-to-day basis – this is about strategic health and engagement. A lot of the information takes time to gather, and even more time to show useful trends.
Some of this may be blindingly obvious, and hopefully a few ideas (or more) will be useful. These nuggets are what I’ve built up to help understand the “success” of an open source project I am involved with, or running. I did a number of google searches for similar topics when I was establishing my last few projects – there weren’t too many articles or hints out there like this, so hopefully this is a little “give back”.
There are two different – related but very separate – ways to look at your project in terms of “success”:
Consumption is focused around “are folks effectively using this project”. The really awesome news in this space is that there’s been a lot of thinking about these kinds of measures – it’s the same question that anyone managing a product has been thinking about, so the same techniques to measure product success can be used.
If your project is a software library, then you will likely want to leverage public package management tooling that exists for the language of your choice: PyPi, CPAN, NPM, Swift Package Catalog, Go Search, etc. The trend of these package managers is heading towards relying on Github, Bitbucket, and Gitlab as a hosting location for the underlying data.
Some of the package indices provide some stats – npm, for example, can track downloads through NPM for the past day, week and month. A number of the package indicies will also have some measure of popularity where they try and relate consumption by some metric (examples: http://pypi-ranking.info/alltime, NPM keeps track of “how many dependents and stars a package has, and go-search tracks the number of stars on the underlying repo)
The underlying source system, like github, often have some metrics you can look at – and may have the best detail. Github “stars”, which is a social indicator that’s useful to pay attention to, and any given given repository has graphs that provide some basic analytics as well (although you need to be an admin on that repo to see it under the URI
graphs/traffic). Github also recently introduced the concept of “dependents” in its graphs – although I’ve not seen it populated with any details as yet. Traffic includes simple line charts for the past two weeks covering:
- Unique Cloners
- Unique Visitors
Documentation hosted on the Internet is another excellent place to get some metrics, and is a reasonable correlation to people using your library. If you’re hosting documentation for your project through either github pages or ReadTheDocs, you can embed in a Google Analytics code so that you can get page views, unique users per month, and other standard website analysis from those sites.
I’ve often wished to know how often people where looking at just the README on the front page of a Github repository, without any further setup of docs or github pages, and Ilya Grgorik made a lovely solution in the form of ga-beacon (available on github). You can use this to host a project on Heroku (or your PaaS or IaaS of choice), and it’ll return a image and pass along the traffic to Google Analytics to get some of those basic details.
If your project is a more complete service, and most specifically if you have a UI or a reference implementation with a UI, that offers another possibility for collecting metrics. Libraries from a variety of languages can send data to Google Analytics, and with many web-based projects it’s as easy as embedding the Google Analytics code. This can get into a bit of a grey space – about what your monitoring and sending back. In the presentation entitled “Consider the Maintainer“, Nadia Eghbal cited a preference towards more information that I completely agree with – and if you create a demo/reference download that includes a UI, it seems quite reasonable to track metrics on that reference instance’s UI to know how many people are using it, and even what they’re using – although don’t confuse that with evidence about how people are using your project, it’s far more about people experimenting and exploring when you’re reviewing stats from a reference UI.
There is also StackOverflow (or one of the Stack Exchange variants) where your project might be discussed. If your project grows to the level of getting questions and answers on StackOverflow, it’s quite possible to start seeing tags either for your project, or for specifics of your project – at least if the project encourages Q&A there. You can get basic stats per tag, “viewed”, “active”, and “editors” as well as the number of questions with that tag, which can be a reasonably representation of consumption as well.
Often well before your project is big enough to warrant a StackOverflow tag, Google will know about people looking for details about your project. Leverage Google Trends to search for your project’s name, and you can even compare it to related projects if you want to see how you’re shaping up against possible competition, although pinning it down by query terms can be a real dark art.
Measuring contribution is a much trickier game, and often more indirect, than consumption. The obvious starting point there are code contributions back to your project, but there’s other aspects to consider: bug reports, code reviews, and just conversational aspects – helping people through a mailing list, IRC or Slack channel. When I was involved with OpenStack there was a lot of conversation about what it meant to be a contributor and how to acknowledge it (in that case, for the purposes of voting on technical leadership within the project). Out of those conversations, the Stackalytics website was created by Mirantis to report on contributions in quite a large number of dimensions: Commits, Blueprints (OpenStack’s feature coordination document) drafted and completed, Emails sent, bugs filed and resolved, reviews, and translations.
Mirantis expanded Stackalytics to cover a number of ancillary projects: Kubernetes, Ansible, Docker, Open vSwitch, and so on. The code to Stackalytics itself is open source, available on github at https://github.com/openstack/stackalytics. I’m not aware of other tooling that provides this same sort of collection and reporting – it looks fairly complex to set up and run, but if you want to track contribution metrics this is an option.
For my own use, the number of pull requests submitted per week or month has been interesting, as has the number of bug reports submitted per week or month. I found it useful to distinguish between “pull requests from the known crew” and external pull requests – trying to break apart a sense of “new folks” from an existing core. If your project has a fairly rigorous review process, then time between creation and close or merging of a PR can also be interesting. Be a little careful with what you imply from it though, as causality for time delays is really hard to ping down and very dependent on your process for validating a PR.
There’s a grey area for a few metrics that lay somewhere between contribution and consumption – they don’t really assert one or the other cleanly, but are useful in trying to track the health of a project: mailing lists and chat channels. The number of subscribers to project mailing lists or forums, # of people subscribed to a project slack channel (or IRC channel). The conversation interfaces tend to be far more ephemeral – the numbers can vary up and down quite a bit – and by minute and hour as much as over days or weeks – but within that ebb and flow you can gather some trends of growth or shrinkage. It doesn’t tell you about quality of content, but just that people are subscribed is a useful baseline.
Metrics I Found Most Useful
I’ve thrown out a lot of pieces and parts, so let me also share what I found most useful. In these cases, the metrics were all measured weekly and tracked over time:
- Github repo “stars” count
- Number of mailing list/forum subscribers
- Number of chat channel subscribers
- Number of pull requests submitted
- Number of bugs/issues filed
- Number of clones of the repository
- Number of visitors of the repository
- Number of “sessions” for a documentation website
- Number of “users” for a documentation website
- Number of “pageviews” for a documentation website
- Google Trend for project name
I don’t have any software to help that collection – a lot of this is just my own manual collection and analysis to date.
If you have metrics that you’ve found immensely valuable, I’d love to know about them and what you found useful from them – leave me a comment! I’ll circle back to this after a while and update the list of metrics from feedback.
Right after I posted this, I saw a tweet from Ben Balter hinting at some of the future advances that Github is making to help provide for Community Management.
And following on that Bitgeria poked me about their efforts to provide software development analytics, and in particular their efforts at providing software to collect metrics on what it means to collaborate into their own open source effort: Grimoire Lab.
I also spotted an interesting algorithm that used to compare open source AI libraries based on github metrics:
Aggregate Popularity = (30*contrib + 10*issues + 5*forks)*0.001