Yesterday in our statistics journal club we began reviewing a paper on Dirichlet Processes. The Dirichlet process is a (somewhat complicated) stochastic process used in Bayesian nonparametric models of data. Dirichlet process mixture models, sometimes called infinite mixture models, are a prominent example where DPs are employed to fit a clustering model to data where the number of clusters is part of the model, rather than specified a priori as a hyper parameter. In the course of discussing where they could be applied in computational biology, a member of the club sent around a recent paper in Nature where the authors use a DP model for different mutation rates in breast cancer tumour subtypes.
It’s a great paper, but the DP model does not appear in it. So where is it? Well, the graphical model representation appears on page 18 of 145 of the supplemental materials, and a short description of how they fit the DP to their data to estimate the probability of observing cells with a given mutation appears at the end of page 137. It is a small grumble, but consider this: this is just one part of the analysis of a huge amount of RNA-seq data (along with other types of complementary sequencing data) in a really complicated and intricate research effort, which forms a paper within the paper. I have to wonder if any similarly capable group of researchers could replicate these results if they had just this paper to guide them. After all, what is the point of
I want to make it clear that I think this paper represents good science, but the way in which they share that science is not great. The code for the DP (no language mentioned) is available via the authors. It would surely be more accessible if submitted to Bioconductor (if it’s in R), or as part of an IPython notebook (if it’s in Python, like Titus Brown‘s diginorm work), or even on GitHub. The authors could set it up there, and add a script that pulls a small sample from the NCBI Short Read Archive for testing or demonstration purposes. This would be more work, but probably not that much more than preparing a manuscript that is a Nature calibre paper, and it would make the code within *much* more accessible and reusable by the scientific community.