Forgotten drafts: Headed to NIPS2013

[ed: I wrote this while stuck in the Denver airport after a missed connection to Reno (final destination: NIPS), and summarily forgot about it. I’m beginning to digest some NIPS papers, so it feels right to post this now.]

I write this while stuck in the Denver airport after a missed connection. I’m very excited to attend NIPS for the first time, though a little nervous. With so much material presented each day, for five days straight, I’m worried that I’m going to miss something important. Already I’m missing the first few hours of the first poster session through this delay :/

In the meantime, here are the posters I’m going to focus on for the Thursday night session. The great thing about NIPS is that all papers are up, and easily imported into Mendeley.

A quick thought on writing about failure

I recently finished reading a great resource for CS PhD students that I’d like to advertise: Philip Guo’s PhD Grind.

It’s a memoir that recounts some of the author’s experiences accrued during his PhD. There are plenty of passages which speak to PhD students (Guo posts a fraction of the torrent of feedback he gets on his site). I certainly empathized with some of them. I think he’s shared some valuable lessons within about academic problems that are both common and hard to work through: finding research topics, finding inspiration, how to strike up or nurture collaborations. If you’re spinning your wheels in the lab, or you doubt that you’re smart enough for grad school, or you want to take ownership of your PhD but aren’t sure how to begin, read this book. Each chapter is a gem.

These are lessons rather than recipes. Your circumstances will be different. But he’s been through struggles shared by many CS grad students, and come out wiser. Reading about them should bring you some comfort, and maybe, if you’re lucky, some inspiration.

Oh, I had a point I wanted to make about dealing with professional (i.e academic) rejection. In The PhD Grind, Guo experiences plenty of paper rejections, frustrations, and ends up questioning whether or not he’s got the chops to continue in academia. He doubts that he’s good enough, that he’s smart enough, that he’s accomplished enough to earn a professorship. As the book ends he decides to ‘retire’ from academic life, for a number of reasons; a decision with which he’s perfectly at peace. Today he’s a professor at University of Rochester. What changed?

It made me recall a similar personal story of the philosopher Mark Kingwell I encountered in his book called the Pursuit of Happiness. He wrote it many years ago while working as a sessional lecturer at University of Toronto. I can still recall him recount with precision how he felt he did not belong in the philosophy department at U of T. He finished the book with, it seemed, every intention of never working there again. Today he’s a tenured professor at University of Toronto. What changed?

Did each arrive at some kind of Michael Corleone moment? Probably not. But I do think there’s a common lesson illustrated here. Maybe these two examples show how writing about failure helps to channel personal feelings of self-doubt into something more constructive. Maybe writing candidly about failure [1] has therapeutic benefits. Maybe it helps release you from the pressure of not achieving your goal, and allows you to refocus on what you should do next.

[1] This probably only works when you publish the work in some form.

NIPS Top 6 papers roundup

I’ve been stalling on this post for a while. I had a fantastic time at Nips this year. I was completely unprepared to go to a conference where everyone I encountered was a machine learning researcher. Never once did I have to use high level terms to describe my research; I could just let the jargon fly free, and be understood!

I’ve chosen a small selection of papers that spoke to me. There were tonnes of good papers, and plenty of blog posts reviewing them (a good start is to look for the #nips2013 twitter stream). This selection is what speaks to my present interests: multi-task learning papers, stochastic gradient variance reduction, clustering, and of course anything applied to biological imaging.

Jason Chang, John W. Fisher III: Parallel Sampling of DP Mixture Models using Sub-Cluster Splits

Summary: They develop an MCMC sampler that can be parallelized, leading to good performance gains. They maintain two sub-clusters for each cluster by way of auxiliary variables, and use sub clusters to propose a good split. They claim to be able to enforce the convergence of the stationary distribution of the Markov chain (presumably to the posterior of the cluster assignments) without approximations. Code is available, which is a big plus.

Michael C. Hughes, Erik B. Sudderth: Memoized online variational inference for Dirichlet Process mixture models

Summary: They develop a new variational inference algorithm that stores and maintains finite-dimensional sufficient statistics from batches of a (very) large dataset. Seems like a split-merge model. They also have code available @

Dahua Lin: Online Learning of Nonparametric Mixture Models via Sequential Variational Approximation

Summary: Another variational approximation inference algorithm for DP mixture models, that also scales to very large data sets. Didn’t catch much of it at the poster session, but looks promising.

Marius Pachitariu, Adam Packer, Noah Pettit, Henry Dagleish, Michael Hausser, Maneesh Sahani: Extracting regions of interest from biological images with convolutional sparse block coding

Summary: A great paper about formulating a generative model of biological images, based on convolutional sparse block coding. This is basically a grouped deformable prototype model, where prototypes are grouped into blocks. A cell image in this model is represented as an interpolation with different coefficients over the elements of the block, for several blocks. The code is not yet available, but it seems like a useful way to do segmentation of cell objects when they display more variation than yeast.

Nagarajan Natarajan, Inderjit S. Dhillon, Pradeep Ravikumar, Ambuj Tewari: Learning with Noisy Labels

Summary: They proposed a model of how to learn a binary classifier in the presence of random classification noise in the labels. The labels have been independently flipped with some small probability. It’s a new interpretation of biased loss functions; two methods, both of which reduce to the functional form of biased class composition SVM loss.

Xinhua Zhang, Wee Sun Lee, Yee Whye Teh: Learning with Invariance via Linear Functionals on Reproducing Kernel Hilbert Space

Summary: A bit outside my field of interest, but this paper really impressed me. They derive a method to learn kernels that learn their parameters to respect formalized invariance properties. The main contribution here is a representer theorem for their augmented loss function. Take a kernel family, as well as some function that encodes a given invariance property (e.g affine transformation), and they have an optimization program that will find the appropriate kernel parameters.

This is why we need code review.

One of the most fundamental operations in parallel computing is an operation called scan. Scan comes in two varieties, inclusive scan and exclusive scan (depending on whether y_i includes x_i or not.

From wikipedia: In computer science, the prefix sum, scan, or cumulative sum of a sequence of numbers x_0, x_1, x_2,\; \dots is a second sequence of numbers y_0, y_1, y_2,\; \dots , the sums of prefixes (running totals) of the input sequence y_i = y_{i-1} + x_i (note this definition is inclusive).

Simple enough right? I’ve seen two simple variants for computing scan, one by Hillis and Steele, the other by Blelloch. Though it seems like scan is an inherently serial computation due to the dependency on the previous input, it actually decouples very nicely and can be computed in parallel very easily. Hillis and Steele’s exclusive scan algorithm goes something like this:

len = input.size;
output[threadId] = input[threadId];
for (int i = 1; i = i)
output[threadId] += input[threadId - i]
[ed: wordpress is not rendering this properly, but you can find the pseudo-code description in the link below]

Hillis and Steele’s scan is what is called step efficient: it executes in O(\log(n)) steps for inputs of size n. But at each step, it performs O(n) operations, and so the overall work complexity is O(n * \log(n)). Blelloch’s is more complex, but is more work efficient: it requires only O(n) operations. Here’s my code for Hillis and Steele’s inclusive scan:

__global__ void hs_prefix_add(int * d_in, unsigned int *d_out)
extern __shared__ unsigned int s_in[];
extern __shared__ unsigned int s_out[];
int tid = threadIdx.x;

// load input into shared memory.
s_in[tid] = static_cast(d_in[tid]);
s_out[tid] = s_in[tid];

for (int offset = 1; offset < blockDim.x; offset <= offset)
s_out[tid] += s_in[tid - offset];

d_out[tid] = s_out[tid];

I’m using shared memory for the buffers since it’s directly on board the thread blocks, and is thus much faster to access than going back to global memory for every step of the loop. Simple, right? When I first tried to implement scan, I thought I’d see if Nvidia would offer any pointers explaining the finer points. After all, my implementation doesn’t handle cases with arrays that are not a power of 2, and won’t scale to arrays larger than the dimension of a block (which happened to be fine for the problem I was solving, but not o.k in general). So I found this document, talking about prefix sum. Here’s their algorithm for Hillis and Steele’s exclusive scan:

__global__ void scan(float *g_odata, float *g_idata, int n)
extern __shared__ float temp[]; // allocated on invocation
int thid = threadIdx.x;
int pout = 0, pin = 1;
// load input into shared memory.
// This is exclusive scan, so shift right by one and set first elt to 0
temp[pout*n + thid] = (thid > 0) ? g_idata[thid-1] : 0;
for (int offset = 1; offset = offset)
temp[pout*n+thid] += temp[pin*n+thid - offset];
temp[pout*n+thid] = temp[pin*n+thid];
g_odata[thid] = temp[pout*n+thid1]; // write output

When I read this code, I have to ask at least the following:

  • I don’t understand why this code uses ‘double buffer indices’ for one large array, rather than two shared memory arrays. It makes for less readable code, and introduces a potential bug if the user places the swap below the if block rather than above.
  • The last line which writes the output back to global memory contains a typo (thid1 should be thid)

This could go into a textbook as an example of why we need code review. That it was published with a compilation error tells me that this code was un-tested. Even after fixing it, try running it at home and see what you get (hint: not what you would expect).

Perhaps the mistakes and obfuscation were intentional, to foil google-copy-and-paste homework submissions for future courses. But I think that greater damage is done by having present and future searches point to obfuscated and incorrect code (the same code, error and all, appears in these slides). Parallel computation is hard enough without having to decode stuff like this. We should be teaching people to write legible, easily verifiable code that can later be optimized or generalized.

A belated MLSB 2013 retrospective

MLSB is a special interest group meeting that is co-located with ISMB, the Intelligent Systems for Molecular Biology conference. MLSB is devoted to Machine Learning research in the area of Systems Biology. It brings together machine learning folks, statisticians, geneticists, and biologists; it was a great place to strike up discussion about how to do better analysis on biological data. The full program is available online, so here I’m going to write up a few highlights for each day.


Markus Heinonen gave a solid talk about time dependent GP regression and significance testing for sparse time series. His group was examining the behaviour of cells under irradiation: how does this treatment affect both diseased and healthy cells with respect to gene regulation, cell processes, and the development of radiation disease. They pose this as trying to determine which genes / proteins are differentially expressed in irradiated cells over a time course. He applied GP regression to time series, using significance tests for sparse time series to determine significant expression (or changes). A nice piece of work, but they didn’t provide a citation since it was probably not yet accepted. But I found a link via google scholar.

Lani Wu gave a good invited talk about two recent stories from their lab involving neutrophil locomotion: one about perturbation analysis, and the other about cellular heterogeneity. There wasn’t anything really new in their methods, but what made an impression was the thoroughness of their analysis. Their lab has a great record of turning out quality work. As a fellow high throughput image biologist, who knows all the pain involved when trying to extract meaning from images (which are usually generated in a really noisy way), I was impressed 🙂

Quaid Morris gave a concise and informative talk, presenting work by Shankar that extended Sara Mostafavi‘s work on 3Prop. It addresses function prediction where the input is a set of heterogeneous networks, as well as a group of genes, and the desired outputs are labels for the set of query genes. Their method is called LMGraph, and it incorporates feature information into the function prediction algorithm. They extract discriminative features from all available networks, and combine them via a simple weighting scheme. This has nice properties: it’s highly scalable, using efficient linear classifiers, and it combines both network with feature data to learn interpretable models.


Johannes Stephan gave a cool talk about Mixed Random Forests. They developed a new model intended for GWAS data, to try and detect functional relationship between phenotype and genotype while accounting for confounding factors due to population structure.

The standard approach to addressing population structure is to use a linear mixed model:
y = x\beta + \mu + \epsilon
\mu \sim N(0,\sigma^{2}_{g}K) where K is the structured covariance matrix, \epsilon \sim N(0, \sigma^{2}_{v}I)
Then do a log ratio test on Beta coefficients.

A more difficult model is one where many SNPs that have interaction effects which are non-linear
f_s(X) = X_1\beta_1 + X_2\beta_2 + \ldots

Their idea is to combine linear mixed models with tree based models to infer f_s(X) in the presence of the population effect. The nice thing about growing trees is that the method is fast, but are unstable with regards to small changes in input (highly discontinuous). This instability can be mitigated by subsampling and building many trees (RFs). The take home message is to use RF for feature selection, they use linear mixed models for prediction.

Finally, John Marioni gave a really interesting technical talk about the challenges in single cell RNA sequencing. The big idea is: multiple cells from a population -> single cell RNA-seq -> correlate patterns across cells -> infer cell type.

The big challenges remaining:

  • quantify variability, identify highly variable genes
  • visualization and clustering
  • downstream interpretation

John described several technical aspects of experiments in single cell RNA-seq. Long story short: technical variability is still a problem, e.g count value ranges quickly increase when technical noise is large. One example nice bit of work was their work to estimate kinetic model parameters for stochastic models of gene expression from single cell RNA expression data. This talk again didn’t contain anything so new and shiny, but the quality and thoroughness of the work presented that captured my attention. This was my first exposure to real data from single cell RNA seq experiments, which alone was enough to impress me.

Looking for an industry job? Take note

I’m currently on leave to do an internship at a startup company. When people asked me why I decided to pursue an internship, I replied (in jest) that after so many years of grad school, I should try to show that I’m still employable. Today I read an article that suggests this is more true than I realized.

The article by Chand John, a PhD grad in Computer Science from Stanford, underscores the importance of industry experience or exposure when targeting an industry job after graduation. John’s job search (thankfully successful) took one whole year. He went to informational interviews, he vetted his resume with friends in industry, he studiously prepared for each one-on-one interview. Getting interviews? Not a problem, he landed more than 30 interviews before finding a job which interested him and a company willing to take a chance on a PhD grad with no industry experience:

No one could pinpoint anything I was doing wrong. Professors and industry veterans inferred I must be saying something really crazy to destroy myself in 30-plus interviews: There was “no way” a person with my credentials could be denied so many jobs. However, I had said nothing crazy. My interviews had largely gone smoothly. And I did eventually land a job closely related to my Ph.D. But the opportunity didn’t arise until a year after finishing my doctorate. Before that lucky break, my accomplishments and efforts weren’t paying off.


As a scientist, I had already been gathering data about that question. Each time I was rejected from a job, I asked the companies for reasons. They were often vague, but two patterns emerged: (1) Companies hesitated to hire a Ph.D. with no industry experience (no big surprise) even if they had selected you for an interview and you did well (surprise!). And (2) my Ph.D. background, while impressive, just didn’t fit the profile of a data scientist (whose background is usually in machine learning or statistics), a product manager (Ph.D.’s couldn’t even apply for Google’s Associate Product Manager Program until recently), or a programmer (my experience writing code at a university, even on a product with 47,000 unique downloads, didn’t count as coding “experience”).

On the first reading, this article struck me as quite sombre: if this Stanford PhD grad took a year to find a job, what hope do the rest of us have? But after reading more carefully, I noticed there were some important steps he did not undertake which put him at comparative disadvantage: the lack of industry experience, the mismatches between his skills and the skills that employers were looking for (viz: machine learning experience for data science jobs). So what does this mean for PhD students looking towards industry after graduation? Don’t just assume your status as a PhD grad will make you an attractive candidate. PhD students don’t have a monopoly on learning quickly. When competing for industry jobs, assume you’re only as attractive as your skills, your experience, and your portfolio.

If we want to transition into industry after graduation, then we need to make ourselves into attractive candidates for those jobs. That could include internship experience to develop your portfolio. That could mean contributing to OSS projects that have credibility in industry. That could mean taking the classes which may not directly relate to your current topic, but will help you develop skills which are in demand.

John closes with a salient point: public dollars funds much of PhD research. The government investments in students to develop their skills, and in exchange these grads will repay this investment many-fold over their careers, enriching society with the output of their work. When PhD grads struggle to contribute, everyone loses.

R by Radford Neal

Radford Neal recently announced the release of pretty quick R (pqR), a fork of R from 2.15.0. It’s got lots of nice performance boosts, the most valuable of which for me is transparent multiprocessing capability.

I won’t repeat any more of the the post announcing it, check it out for yourself. I hate to be a back-seat developer, but why not use the name fastR?