Category Archives: Machine Learning

NIPS2014 de-brief: the Friday workshops

Caveat lector: I began composing this back in January, then set it aside to work on thesis stuff. It has been overlooked ever since. NIPS is getting pretty far in the rear-view, but the workshops this year were excellent, and deserve some mention. So, though it’s under-prepared, and looking back I see I didn’t take nearly as many notes as I should have, I’m going to force this post out now. Please don’t interpret the sparsity of my notes as lack of interest.

The last two days of NIPS2014 were all about the workshops. On Friday, I attended the inaugural offering of Machine Learning for Clinical Data Analysis, as well as the workshop on Deep Learning and Representation Learning.

Machine Learning for Clinical Data Analysis

I wish I had taken better notes from the morning session of ml4chg. The invited talk by Gilles Clermont was structured around trying to find the right questions to ask with respect to using machine learning in clinical settings. The ICU is expensive, and technologically intensive. Mortality is high, and risk is high. Doctors want to be convinced beyond just figures showing AUPR, AUROC are good. They want to be convinced that we can use fancy computer models to reduce errors, and manage data. I left thinking that there remain a lot of questions to be answered (and relationships between clinicians and researchers to be built) before any rubber hits the road. The Q&A session that finished off the morning underscored this last point.

Deep Learning

I arrived at the first (!) poster session. Here were a couple that caught my eye.

Invited talk by Rich Schwartz (BBN): Fast and Robust Shallow Neural Networks Joint Models for Machine Translation

  • shallowest ngram translation model. Simpler was better, but no phrases, no previous translations that were incorporated.
  • Yoshua had loads of ideas to improve what was presented.

Invited talk by Vlad Mnih (Google DeepMind): Deep Reinforcement Learning and REINFORCEd Deep Learning:

  • RL and deep learning should be better friends.
  • big problems: noisy, delayed signals. Nns don’t necessarily work well in non stationary environments, which is the case in RL
  • Q learning updates: smoothed action reward scheme. Froze params of network that generated the target actions, periodically refreshed this.
  • another nice thing about RL: add non-differentiable components for NNs using reinforce.
  • attention model reduces the amount of computation for convnets, fixes the amount of conv

Invited talk by Surya Ganguli (Stanford): From statistical physics to deep learning: on the beneficial role of dynamic criticality, random landscapes, and the reversal of time.

  • good stuff from statistical physics that can help deep learning (published work on his website)
  • theory of deep learning (initializing weight dynamics theory), says number of epochs does not need to increase with depth, if chosen properly.
  • questions from: Geoff Hinton, Then Yann Lecun, then Yoshua Bengio. Got a big laugh from the audience, but was in itself instructive; the three wise men all believe that physics has something to teach us about the mechanics of error surfaces in deep learning.

That’s it. There were a few photos of posters that I wanted to recall the details later that I won’t post, since while I asked for permission to take the photos, I didn’t think to ask permission to share them. The Saturday workshops were all about MLCB for me, to which I’ll devote a separate post.


NIPS2014 de-brief: the main conference

NIPS 2014 has come and gone. My brain is full of ideas, I’ve met a host of new people, and my body feels shattered by the 7:30 to 12:00 schedule. This was my second time attending NIPS, and this meeting has strengthened my opinion that the most valuable part of the conference are the posters. The poster sessions are the first-class content. They’re interactive, all the papers are online, and there’s no scheduling problems. Just go around, find work that suits you, and meet the people responsible. For such a large single track conference, they’re indispensable. While you can read the papers online from anywhere, it is nowhere near so rewarding as speaking to the authors directly. If you ask the right question, and the author isn’t too wiped out by the previous four hours of Q&A, you might get insights that don’t make it into the paper, which can be the beginning of a great new idea. It’s fast mixing of chains.

Perhaps I will begin to feel differently about these exhausting sessions as I attend more meetings, and will become less inclined to spend hours among the halls. But for now, the opportunity to connect with such smart, committed people and to tell them personally how much I appreciate their hard work is my favourite part of NIPS. I hope that never changes.

I’ll avoid commenting on any of the talks since, invited speakers apart, each talk is derived from a poster. There are over 100 on display each session, and four sessions in total. I’ll report my favourite poster from each session. You can find all the posters with links to the papers here.

Monday’s poster: Improved Multimodal Deep Learning with Variation of Information [Mon59]

Deep learning has been successfully applied to multimodal representation learning problems, with a common strategy to learning joint representations that are shared across multiple modalities on top of layers of modality-specific networks. Nonetheless, there still remains a question how to learn a good association between data modalities; in particular, a good generative model of multimodal data should be able to reason about missing data modality given the rest of data modalities. In this paper, we propose a novel multimodal representation learning framework that explicitly aims this goal. Rather than learning with maximum likelihood, we train the model to minimize the variation of information. We provide a theoretical insight why the proposed learning objective is sufficient to estimate the data-generating joint distribution of multimodal data. We apply our method to restricted Boltzmann machines and introduce learning methods based on contrastive divergence and multi-prediction training. In addition, we extend to deep networks with recurrent encoding structure to finetune the whole network. In experiments, we demonstrate the state-of-the-art visual recognition performance on MIR-Flickr database and PASCAL VOC 2007 database with and without text features.

Why it’s cool: Multi-modal data is commonplace in computational biology, so I’m all for models that incorporate it. What made this paper stand out for me was that they did not try to learn the model parameters by maximizing the likelihood of the data.

Tuesday’s poster: Nonparametric Bayesian inference on multivariate exponential families [Tue25]

We develop a model by choosing the maximum entropy distribution from the set of models satisfying certain smoothness and independence criteria; we show that inference on this model generalizes local kernel estimation to the context of Bayesian inference on stochastic processes. Our model enables Bayesian inference in contexts when standard techniques like Gaussian process inference are too expensive to apply. Exact inference on our model is possible for any likelihood function from the exponential family. Inference is then highly efficient, requiring only O(log N) time and O(N) space at run time. We demonstrate our algorithm on several problems and show quantifiable improvement in both speed and performance relative to models based on the Gaussian process.

Why it’s cool: They promise to deliver a really big reward: exact inference for *any* likelihood function in the exponential family. Kernel density estimation for non-parametric Bayesian models? Far out.

Wednesday’s poster: Identifying and attacking the saddle point problem in high-dimensional non-convex optimization [Web39]

A central challenge to many fields of science and engineering involves minimizing non-convex error functions over continuous, high dimensional spaces. Gradient descent or quasi-Newton methods are almost ubiquitously used to perform such minimizations, and it is often thought that a main source of difficulty for these local methods to find the global minimum is the proliferation of local minima with much higher error than the global minimum. Here we argue, based on results from statistical physics, random matrix theory, neural network theory, and empirical evidence, that a deeper and more profound difficulty originates from the proliferation of saddle points, not local minima, especially in high dimensional problems of practical interest. Such saddle points are surrounded by high error plateaus that can dramatically slow down learning, and give the illusory impression of the existence of a local minimum. Motivated by these arguments, we propose a new approach to second-order optimization, the saddle-free Newton method, that can rapidly escape high dimensional saddle points, unlike gradient descent and quasi-Newton methods. We apply this algorithm to deep or recurrent neural network training, and provide numerical evidence for its superior optimization performance.

Why it’s cool: Great work by Yann Dauphin and Razvan Pascanu, in conjunction with Surya Ganguli. They used arguments from statistical physics, and some neat experiments to demonstrate that what people originally thought to be bad local minima in the search for parameters in deep networks are actually saddle points which are hard to escape. They propose a modified quasi-Newton method which works well on large auto encoder methods . Even better, their code will soon be released in Theano. Maybe my favourite poster.

Thursday’s poster: Distributed Variational Inference in Sparse Gaussian Process Regression and Latent Variable Models [Thu63]

Gaussian processes (GPs) are a powerful tool for probabilistic inference over functions. They have been applied to both regression and non-linear dimensionality reduction, and offer desirable properties such as uncertainty estimates, robustness to over-fitting, and principled ways for tuning hyper-parameters. However the scalability of these models to big datasets remains an active topic of research. We introduce a novel re-parametrisation of variational inference for sparse GP regression and latent variable models that allows for an efficient distributed algorithm. This is done by exploiting the decoupling of the data given the inducing points to re-formulate the evidence lower bound in a Map-Reduce setting. We show that the inference scales well with data and computational resources, while preserving a balanced distribution of the load among the nodes. We further demonstrate the utility in scaling Gaussian processes to big data. We show that GP performance improves with increasing amounts of data in regression (on flight data with 2 million records) and latent variable modelling (on MNIST). The results show that GPs perform better than many common models often used for big data.

Why it’s cool: Distrubuted computation of GP and GPLVMs. Their key insight is that, conditioned on inducing points, the observations decouple, so p(F_i | X, u) and p(Y_i | F_i) can be evaluated in parallel for every i.

That’s all for now. If you enjoy Montreal, and are thinking of attending NIPS next year, then rejoice; it will be located in Montreal again. The Palais des Congrès was a bright, spacious, easily accessible venue. I can’t wait to return.

Bayesian non-parametric models for number of component estimation: bad?

I’ve been reading several different resources on Bayesian non-parametric models (abbrv. NPB), mostly for clustering.  One of the well-communicated benefits of Bayesian non-parametric models for clustering is that you do not have to specify the parametric capacity (for lack of a better term) of the model; inference of the model subsumes this.  So I was really surprised to read Peter Orbanz critique of choosing Bayesian non-parametric models by reason of elegant model selection (section 2.9, pp 19-20). According to him, this is wrong:
An old and very important problem in mixture modeling is how to select the number of mixture components, i.e. the order K of the mixture. Whether and how DP mixtures and similar models provide a solution to this problem is one of the most commonly misunderstood aspects of Bayesian nonparametrics:
Bayesian nonparametric mixtures are not a tool for automatically selecting the number of components in a finite mixture. If we assume that the number of clusters exhibited in the underlying population is finite, Bayesian nonparametric mixtures are misspecified models.
Why the surprise?  Well, in this 2011 NIPS tutorial, given by Yee Whye Teh & Peter Orbanz, one of the benefits of NPB models is their elegant approach to model selection!  Another introductory review of these models, by Samuel Gershman & David Blei, opens by extolling the virtuous qualities of NPB models for model selection in clustering. Why this repudiation NPB for model selection?
Orbanz develops his argument as follows:
I will try to argue this point in more detail: If we use a DP mixture on a sample of size n, then in any clustering solution supported by the posterior, a random, finite number K_n ≤ n of clusters is present. Hence, we obtain a posterior distribution on the number of clusters. Although this is not technically model selection—since there is just one model, under which the different possible values of K_n are mutually exclusive events—we indeed have obtained a solution for the number of clusters. However, a DP random measure has an infinite number of atoms almost surely.  Hence, the modeling assumption implicit in a DP mixture is that
as n → ∞, we will inevitably (with probability 1) observe an infinite number of clusters.
To choose an adequate strategy for determining the number of components, we have to distinguish three types of problems:
(1) K is finite and known, which is the assumption expressed by a finite mixture model of order K.
(2) K is finite but unknown. In this case, the appropriate model would be a
finite mixture model of unknown order. This problem should not be modeled
by a DP or other infinite mixture.
(3) K is infinite, as assumed by the Dirichlet process/CRP or the Pitman-Yor
In many clustering problems, (3) is really the appropriate assumption: In topic modeling problems, for example, we assume that a given text corpus contains a finite number of topics, but that is only because the corpus itself is finite. If the corpus size increases, we would expect new topics to emerge, and it would usually be very difficult to argue that the number of topics eventually runs up against some finite bound and remains fixed.
What if we really have reason to assume that (2) is adequate? In a classical
mixture-model setup—estimate a finite mixture by approximate maximum likelihood using an EM algorithm, say—choosing K in (2) is a model selection problem, since different K result in different models whose likelihoods are not comparable. This problem is also called the model order selection problem, and popular solutions include penalty methods (AIC or BIC), stability, etc.
Orbanz goes on to cite a paper by Jeffery Miller and Matthew Harrison, who show that under certain (but common) conditions,  the posterior distributions of Dirichlet Process mixture and Pitman-Yor mixture models are not consistent; the posterior will not concentrate on the true number of mixture components with probability one.  I won’t reproduce their results here, the MathJax support for Evernote is not great on the web interface.  They do, however, offer this remark:
We would like to emphasize that this inconsistency should not be viewed as a deficiency of Dirichlet process mixtures, but is simply due to a misapplication of them. As flexible priors on densities, DPMs are superb, and there are strong results showing that in many cases the posterior on the density converges in L1 to the true density at the minimax-optimal rate, up to a logarithmic factor (see Scricciolo (2012), Ghosal (2010) and references therein). Further, Nguyen (2013) has recently shown that the posterior on the mixing distribution converges in the Wasserstein metric to the true mixing distribution. However, these results do not necessarily imply consistency for the number of components, since any mixture can be approximated arbitrarily well in these metrics by another mixture with a larger number of components (for instance, by making the weights of the extra components infinitesimally small).
This seems to be the problem: non-empty but very small superfluous clusters.  This has been studied both in theory and borne out in experiments before.  The authors continue:
This overestimation seems to occur because there are typically a few tiny “extra” clusters.  Among researchers using DPMs for clustering, this is an annoyance that is often dealt with by pruning such clusters — that is, by simply ignoring them when calculating statistics such as the number of clusters. It may be possible to obtain consistent estimators in this way, but this remains an open question
Q: What does this all mean for the rest of us who want to cluster their data without running an outer loop for model selection?
A: I’ll read the Rousseau and Mengersen paper to see what they report, but I think that it’s still ok to use NPB to get a good idea of how many components are in a finite mixture, even if Orbanz’ condition (3) does not hold.  What’s most important for researchers looking to understand their data set is not that an NPB posterior be concentrated almost surely on the true number of components, but that it be *mostly* right about the number of components.  As long as the number of superfluous components is small, and their associated clusters also small, it’s not a big deal.  Using NPB has the extra virtue of forcing users to  think harder about aspects of the model (specifying base measures, prior assumptions), which simpler clustering models like k-means or mixture of Gaussians do not.

The Bit-Bucket Challenge

The ALS Ice Bucket Challenge is sweeping the internet, and people are busy coming up with crafty and audacious ways to dump icy water on their heads. I’d like to propose a lesser challenge, one that would help people who work with data everywhere. I call it the bit-bucket challenge. In the words of Godard:

it’s not important where you steal from, it’s where you bring it to

So, here’s my proposal:

ML researchers of the world: please, please, please, release your code under a permissive (BSD,MIT, maybe L-GPL) license.

No need to make a video, no need to agonize over the fine line between slackto-narcisissism, and do-goodery. None of that. This is *much* easier. Here’s how to do it:

  1. Sign up on GitHub, or bit-bucket, or source-forge, or make your own webpage. Just find somewhere to host your project.
  2. Take the readme.txt you stuck in the main working directory of your project, and annotate that in markdown. It should tell people how to build your code, and make note of any obvious stumbling points in the code. Take this time to write one, if you haven’t already done so.
  3. Upload all of that onto the internet (see step 1).
  4. Make a blog post or a tweet briefly describing what you’ve just posted.
  5. Tag three colleagues in your post, challenging them to release a project that they’ve published as a paper.

That’s it.

Is your code ugly? Full of hacks? Hard coded values? Bubble sort? Do not let this deter you; if your code is the least bit functional, someone who has read your paper and would like to try your algorithm will take that mess and improve it. They will patch it, refactor it or make other radical changes to suit their purpose. Chances are good that they’ll share these improvements with you, gratis! Heck, they might even write new tests for your code (you have some tests, right?). But they can’t do that unless you give them the chance.

Why is this a worthwhile challenge? Well, re-implementing a scientific paper is hard. Also, software is now firmly entrenched in the scientific process. If you’re doing computational method development and just writing about your results, I don’t think you’re doing it properly.

“if it’s not open and verifiable by others, it’s not science, or engineering, or whatever it is you call what we do.” (V. Stodden, The scientific method in practice)

(h/t to Gael Varoquaux)

Worried that no one will care about your code? You’re in good company. There are plenty of projects out there not used by anyone other than the author, or algorithms of marginal value. Believe me, I’ve written plenty. But on the off-chance that someone *does* find your code useful, you’ve just saved this person many, many hours of work in trying to re-implement your algorithm from scratch.

So please, dear friends, let’s all share some code. Just do it. Do it for the betterment of us all.

Forgotten drafts: Headed to NIPS2013

[ed: I wrote this while stuck in the Denver airport after a missed connection to Reno (final destination: NIPS), and summarily forgot about it. I’m beginning to digest some NIPS papers, so it feels right to post this now.]

I write this while stuck in the Denver airport after a missed connection. I’m very excited to attend NIPS for the first time, though a little nervous. With so much material presented each day, for five days straight, I’m worried that I’m going to miss something important. Already I’m missing the first few hours of the first poster session through this delay :/

In the meantime, here are the posters I’m going to focus on for the Thursday night session. The great thing about NIPS is that all papers are up, and easily imported into Mendeley.

NIPS Top 6 papers roundup

I’ve been stalling on this post for a while. I had a fantastic time at Nips this year. I was completely unprepared to go to a conference where everyone I encountered was a machine learning researcher. Never once did I have to use high level terms to describe my research; I could just let the jargon fly free, and be understood!

I’ve chosen a small selection of papers that spoke to me. There were tonnes of good papers, and plenty of blog posts reviewing them (a good start is to look for the #nips2013 twitter stream). This selection is what speaks to my present interests: multi-task learning papers, stochastic gradient variance reduction, clustering, and of course anything applied to biological imaging.

Jason Chang, John W. Fisher III: Parallel Sampling of DP Mixture Models using Sub-Cluster Splits

Summary: They develop an MCMC sampler that can be parallelized, leading to good performance gains. They maintain two sub-clusters for each cluster by way of auxiliary variables, and use sub clusters to propose a good split. They claim to be able to enforce the convergence of the stationary distribution of the Markov chain (presumably to the posterior of the cluster assignments) without approximations. Code is available, which is a big plus.

Michael C. Hughes, Erik B. Sudderth: Memoized online variational inference for Dirichlet Process mixture models

Summary: They develop a new variational inference algorithm that stores and maintains finite-dimensional sufficient statistics from batches of a (very) large dataset. Seems like a split-merge model. They also have code available @

Dahua Lin: Online Learning of Nonparametric Mixture Models via Sequential Variational Approximation

Summary: Another variational approximation inference algorithm for DP mixture models, that also scales to very large data sets. Didn’t catch much of it at the poster session, but looks promising.

Marius Pachitariu, Adam Packer, Noah Pettit, Henry Dagleish, Michael Hausser, Maneesh Sahani: Extracting regions of interest from biological images with convolutional sparse block coding

Summary: A great paper about formulating a generative model of biological images, based on convolutional sparse block coding. This is basically a grouped deformable prototype model, where prototypes are grouped into blocks. A cell image in this model is represented as an interpolation with different coefficients over the elements of the block, for several blocks. The code is not yet available, but it seems like a useful way to do segmentation of cell objects when they display more variation than yeast.

Nagarajan Natarajan, Inderjit S. Dhillon, Pradeep Ravikumar, Ambuj Tewari: Learning with Noisy Labels

Summary: They proposed a model of how to learn a binary classifier in the presence of random classification noise in the labels. The labels have been independently flipped with some small probability. It’s a new interpretation of biased loss functions; two methods, both of which reduce to the functional form of biased class composition SVM loss.

Xinhua Zhang, Wee Sun Lee, Yee Whye Teh: Learning with Invariance via Linear Functionals on Reproducing Kernel Hilbert Space

Summary: A bit outside my field of interest, but this paper really impressed me. They derive a method to learn kernels that learn their parameters to respect formalized invariance properties. The main contribution here is a representer theorem for their augmented loss function. Take a kernel family, as well as some function that encodes a given invariance property (e.g affine transformation), and they have an optimization program that will find the appropriate kernel parameters.

A belated MLSB 2013 retrospective

MLSB is a special interest group meeting that is co-located with ISMB, the Intelligent Systems for Molecular Biology conference. MLSB is devoted to Machine Learning research in the area of Systems Biology. It brings together machine learning folks, statisticians, geneticists, and biologists; it was a great place to strike up discussion about how to do better analysis on biological data. The full program is available online, so here I’m going to write up a few highlights for each day.


Markus Heinonen gave a solid talk about time dependent GP regression and significance testing for sparse time series. His group was examining the behaviour of cells under irradiation: how does this treatment affect both diseased and healthy cells with respect to gene regulation, cell processes, and the development of radiation disease. They pose this as trying to determine which genes / proteins are differentially expressed in irradiated cells over a time course. He applied GP regression to time series, using significance tests for sparse time series to determine significant expression (or changes). A nice piece of work, but they didn’t provide a citation since it was probably not yet accepted. But I found a link via google scholar.

Lani Wu gave a good invited talk about two recent stories from their lab involving neutrophil locomotion: one about perturbation analysis, and the other about cellular heterogeneity. There wasn’t anything really new in their methods, but what made an impression was the thoroughness of their analysis. Their lab has a great record of turning out quality work. As a fellow high throughput image biologist, who knows all the pain involved when trying to extract meaning from images (which are usually generated in a really noisy way), I was impressed 🙂

Quaid Morris gave a concise and informative talk, presenting work by Shankar that extended Sara Mostafavi‘s work on 3Prop. It addresses function prediction where the input is a set of heterogeneous networks, as well as a group of genes, and the desired outputs are labels for the set of query genes. Their method is called LMGraph, and it incorporates feature information into the function prediction algorithm. They extract discriminative features from all available networks, and combine them via a simple weighting scheme. This has nice properties: it’s highly scalable, using efficient linear classifiers, and it combines both network with feature data to learn interpretable models.


Johannes Stephan gave a cool talk about Mixed Random Forests. They developed a new model intended for GWAS data, to try and detect functional relationship between phenotype and genotype while accounting for confounding factors due to population structure.

The standard approach to addressing population structure is to use a linear mixed model:
y = x\beta + \mu + \epsilon
\mu \sim N(0,\sigma^{2}_{g}K) where K is the structured covariance matrix, \epsilon \sim N(0, \sigma^{2}_{v}I)
Then do a log ratio test on Beta coefficients.

A more difficult model is one where many SNPs that have interaction effects which are non-linear
f_s(X) = X_1\beta_1 + X_2\beta_2 + \ldots

Their idea is to combine linear mixed models with tree based models to infer f_s(X) in the presence of the population effect. The nice thing about growing trees is that the method is fast, but are unstable with regards to small changes in input (highly discontinuous). This instability can be mitigated by subsampling and building many trees (RFs). The take home message is to use RF for feature selection, they use linear mixed models for prediction.

Finally, John Marioni gave a really interesting technical talk about the challenges in single cell RNA sequencing. The big idea is: multiple cells from a population -> single cell RNA-seq -> correlate patterns across cells -> infer cell type.

The big challenges remaining:

  • quantify variability, identify highly variable genes
  • visualization and clustering
  • downstream interpretation

John described several technical aspects of experiments in single cell RNA-seq. Long story short: technical variability is still a problem, e.g count value ranges quickly increase when technical noise is large. One example nice bit of work was their work to estimate kinetic model parameters for stochastic models of gene expression from single cell RNA expression data. This talk again didn’t contain anything so new and shiny, but the quality and thoroughness of the work presented that captured my attention. This was my first exposure to real data from single cell RNA seq experiments, which alone was enough to impress me.