Category Archives: Reproducible Research

The Bit-Bucket Challenge

The ALS Ice Bucket Challenge is sweeping the internet, and people are busy coming up with crafty and audacious ways to dump icy water on their heads. I’d like to propose a lesser challenge, one that would help people who work with data everywhere. I call it the bit-bucket challenge. In the words of Godard:

it’s not important where you steal from, it’s where you bring it to

So, here’s my proposal:

ML researchers of the world: please, please, please, release your code under a permissive (BSD,MIT, maybe L-GPL) license.

No need to make a video, no need to agonize over the fine line between slackto-narcisissism, and do-goodery. None of that. This is *much* easier. Here’s how to do it:

  1. Sign up on GitHub, or bit-bucket, or source-forge, or make your own webpage. Just find somewhere to host your project.
  2. Take the readme.txt you stuck in the main working directory of your project, and annotate that in markdown. It should tell people how to build your code, and make note of any obvious stumbling points in the code. Take this time to write one, if you haven’t already done so.
  3. Upload all of that onto the internet (see step 1).
  4. Make a blog post or a tweet briefly describing what you’ve just posted.
  5. Tag three colleagues in your post, challenging them to release a project that they’ve published as a paper.

That’s it.

Is your code ugly? Full of hacks? Hard coded values? Bubble sort? Do not let this deter you; if your code is the least bit functional, someone who has read your paper and would like to try your algorithm will take that mess and improve it. They will patch it, refactor it or make other radical changes to suit their purpose. Chances are good that they’ll share these improvements with you, gratis! Heck, they might even write new tests for your code (you have some tests, right?). But they can’t do that unless you give them the chance.

Why is this a worthwhile challenge? Well, re-implementing a scientific paper is hard. Also, software is now firmly entrenched in the scientific process. If you’re doing computational method development and just writing about your results, I don’t think you’re doing it properly.

“if it’s not open and verifiable by others, it’s not science, or engineering, or whatever it is you call what we do.” (V. Stodden, The scientific method in practice)

(h/t to Gael Varoquaux)

Worried that no one will care about your code? You’re in good company. There are plenty of projects out there not used by anyone other than the author, or algorithms of marginal value. Believe me, I’ve written plenty. But on the off-chance that someone *does* find your code useful, you’ve just saved this person many, many hours of work in trying to re-implement your algorithm from scratch.

So please, dear friends, let’s all share some code. Just do it. Do it for the betterment of us all.


A quick and useful note for organizing comp bio projects

I just happened upon William Noble’s guide to organizing computational biology projects. There are many good practices discussed therein, especially if you don’t have much experience with structuring a project yourself. He touches on a few important points like keeping a readme.txt file in every directory to help chart your progress throughout the project, the virtues of version control, separation of code from data and results.

For me, I would like to push as many of the bookkeeping aspects of a project (status of the project, directory readme.txt files, lab notebooks) onto version control. Why? Because it’s more easily accessible, stored in one place, and can be used to generate any of the reports or views that a readme.txt or lab notebook represents for the given project. Running an experiment? That’s a private github gist with a link to your lab notebook entry (which can be a private blog hosted by github). This way, you use one tool to synchronize different related parts of your project, so your readme.txt files aren’t out of date (or out of sync with your code/results/documentation). It’s easy to share with collaborators, and best of all you don’t have to maintain the infrastructure.

FYI: if you use git, and do not want .pyc / .o / .a / … files cluttering your repo, set up an ignore file.

The nature of things

Yesterday in our statistics journal club we began reviewing a paper on Dirichlet Processes. The Dirichlet process is a (somewhat complicated) stochastic process used in Bayesian nonparametric models of data. Dirichlet process mixture models, sometimes called infinite mixture models, are a prominent example where DPs are employed to fit a clustering model to data where the number of clusters is part of the model, rather than specified a priori as a hyper parameter. In the course of discussing where they could be applied in computational biology, a member of the club sent around a recent paper in Nature where the authors use a DP model for different mutation rates in breast cancer tumour subtypes.

It’s a great paper, but the DP model does not appear in it. So where is it? Well, the graphical model representation appears on page 18 of 145 of the supplemental materials, and a short description of how they fit the DP to their data to estimate the probability of observing cells with a given mutation appears at the end of page 137. It is a small grumble, but consider this: this is just one part of the analysis of a huge amount of RNA-seq data (along with other types of complementary sequencing data) in a really complicated and intricate research effort, which forms a paper within the paper. I have to wonder if any similarly capable group of researchers could replicate these results if they had just this paper to guide them. After all, what is the point of

I want to make it clear that I think this paper represents good science, but the way in which they share that science is not great. The code for the DP (no language mentioned) is available via the authors. It would surely be more accessible if submitted to Bioconductor (if it’s in R), or as part of an IPython notebook (if it’s in Python, like Titus Brown‘s diginorm work), or even on GitHub. The authors could set it up there, and add a script that pulls a small sample from the NCBI Short Read Archive for testing or demonstration purposes. This would be more work, but probably not that much more than preparing a manuscript that is a Nature calibre paper, and it would make the code within *much* more accessible and reusable by the scientific community.