Category Archives: Criticism

The Bit-Bucket Challenge

The ALS Ice Bucket Challenge is sweeping the internet, and people are busy coming up with crafty and audacious ways to dump icy water on their heads. I’d like to propose a lesser challenge, one that would help people who work with data everywhere. I call it the bit-bucket challenge. In the words of Godard:

it’s not important where you steal from, it’s where you bring it to

So, here’s my proposal:

ML researchers of the world: please, please, please, release your code under a permissive (BSD,MIT, maybe L-GPL) license.

No need to make a video, no need to agonize over the fine line between slackto-narcisissism, and do-goodery. None of that. This is *much* easier. Here’s how to do it:

  1. Sign up on GitHub, or bit-bucket, or source-forge, or make your own webpage. Just find somewhere to host your project.
  2. Take the readme.txt you stuck in the main working directory of your project, and annotate that in markdown. It should tell people how to build your code, and make note of any obvious stumbling points in the code. Take this time to write one, if you haven’t already done so.
  3. Upload all of that onto the internet (see step 1).
  4. Make a blog post or a tweet briefly describing what you’ve just posted.
  5. Tag three colleagues in your post, challenging them to release a project that they’ve published as a paper.

That’s it.

Is your code ugly? Full of hacks? Hard coded values? Bubble sort? Do not let this deter you; if your code is the least bit functional, someone who has read your paper and would like to try your algorithm will take that mess and improve it. They will patch it, refactor it or make other radical changes to suit their purpose. Chances are good that they’ll share these improvements with you, gratis! Heck, they might even write new tests for your code (you have some tests, right?). But they can’t do that unless you give them the chance.

Why is this a worthwhile challenge? Well, re-implementing a scientific paper is hard. Also, software is now firmly entrenched in the scientific process. If you’re doing computational method development and just writing about your results, I don’t think you’re doing it properly.

“if it’s not open and verifiable by others, it’s not science, or engineering, or whatever it is you call what we do.” (V. Stodden, The scientific method in practice)

(h/t to Gael Varoquaux)

Worried that no one will care about your code? You’re in good company. There are plenty of projects out there not used by anyone other than the author, or algorithms of marginal value. Believe me, I’ve written plenty. But on the off-chance that someone *does* find your code useful, you’ve just saved this person many, many hours of work in trying to re-implement your algorithm from scratch.

So please, dear friends, let’s all share some code. Just do it. Do it for the betterment of us all.


Looking for an industry job? Take note

I’m currently on leave to do an internship at a startup company. When people asked me why I decided to pursue an internship, I replied (in jest) that after so many years of grad school, I should try to show that I’m still employable. Today I read an article that suggests this is more true than I realized.

The article by Chand John, a PhD grad in Computer Science from Stanford, underscores the importance of industry experience or exposure when targeting an industry job after graduation. John’s job search (thankfully successful) took one whole year. He went to informational interviews, he vetted his resume with friends in industry, he studiously prepared for each one-on-one interview. Getting interviews? Not a problem, he landed more than 30 interviews before finding a job which interested him and a company willing to take a chance on a PhD grad with no industry experience:

No one could pinpoint anything I was doing wrong. Professors and industry veterans inferred I must be saying something really crazy to destroy myself in 30-plus interviews: There was “no way” a person with my credentials could be denied so many jobs. However, I had said nothing crazy. My interviews had largely gone smoothly. And I did eventually land a job closely related to my Ph.D. But the opportunity didn’t arise until a year after finishing my doctorate. Before that lucky break, my accomplishments and efforts weren’t paying off.


As a scientist, I had already been gathering data about that question. Each time I was rejected from a job, I asked the companies for reasons. They were often vague, but two patterns emerged: (1) Companies hesitated to hire a Ph.D. with no industry experience (no big surprise) even if they had selected you for an interview and you did well (surprise!). And (2) my Ph.D. background, while impressive, just didn’t fit the profile of a data scientist (whose background is usually in machine learning or statistics), a product manager (Ph.D.’s couldn’t even apply for Google’s Associate Product Manager Program until recently), or a programmer (my experience writing code at a university, even on a product with 47,000 unique downloads, didn’t count as coding “experience”).

On the first reading, this article struck me as quite sombre: if this Stanford PhD grad took a year to find a job, what hope do the rest of us have? But after reading more carefully, I noticed there were some important steps he did not undertake which put him at comparative disadvantage: the lack of industry experience, the mismatches between his skills and the skills that employers were looking for (viz: machine learning experience for data science jobs). So what does this mean for PhD students looking towards industry after graduation? Don’t just assume your status as a PhD grad will make you an attractive candidate. PhD students don’t have a monopoly on learning quickly. When competing for industry jobs, assume you’re only as attractive as your skills, your experience, and your portfolio.

If we want to transition into industry after graduation, then we need to make ourselves into attractive candidates for those jobs. That could include internship experience to develop your portfolio. That could mean contributing to OSS projects that have credibility in industry. That could mean taking the classes which may not directly relate to your current topic, but will help you develop skills which are in demand.

John closes with a salient point: public dollars funds much of PhD research. The government investments in students to develop their skills, and in exchange these grads will repay this investment many-fold over their careers, enriching society with the output of their work. When PhD grads struggle to contribute, everyone loses.

The nature of things

Yesterday in our statistics journal club we began reviewing a paper on Dirichlet Processes. The Dirichlet process is a (somewhat complicated) stochastic process used in Bayesian nonparametric models of data. Dirichlet process mixture models, sometimes called infinite mixture models, are a prominent example where DPs are employed to fit a clustering model to data where the number of clusters is part of the model, rather than specified a priori as a hyper parameter. In the course of discussing where they could be applied in computational biology, a member of the club sent around a recent paper in Nature where the authors use a DP model for different mutation rates in breast cancer tumour subtypes.

It’s a great paper, but the DP model does not appear in it. So where is it? Well, the graphical model representation appears on page 18 of 145 of the supplemental materials, and a short description of how they fit the DP to their data to estimate the probability of observing cells with a given mutation appears at the end of page 137. It is a small grumble, but consider this: this is just one part of the analysis of a huge amount of RNA-seq data (along with other types of complementary sequencing data) in a really complicated and intricate research effort, which forms a paper within the paper. I have to wonder if any similarly capable group of researchers could replicate these results if they had just this paper to guide them. After all, what is the point of

I want to make it clear that I think this paper represents good science, but the way in which they share that science is not great. The code for the DP (no language mentioned) is available via the authors. It would surely be more accessible if submitted to Bioconductor (if it’s in R), or as part of an IPython notebook (if it’s in Python, like Titus Brown‘s diginorm work), or even on GitHub. The authors could set it up there, and add a script that pulls a small sample from the NCBI Short Read Archive for testing or demonstration purposes. This would be more work, but probably not that much more than preparing a manuscript that is a Nature calibre paper, and it would make the code within *much* more accessible and reusable by the scientific community.