NIPS2014 de-brief: the main conference

NIPS 2014 has come and gone. My brain is full of ideas, I’ve met a host of new people, and my body feels shattered by the 7:30 to 12:00 schedule. This was my second time attending NIPS, and this meeting has strengthened my opinion that the most valuable part of the conference are the posters. The poster sessions are the first-class content. They’re interactive, all the papers are online, and there’s no scheduling problems. Just go around, find work that suits you, and meet the people responsible. For such a large single track conference, they’re indispensable. While you can read the papers online from anywhere, it is nowhere near so rewarding as speaking to the authors directly. If you ask the right question, and the author isn’t too wiped out by the previous four hours of Q&A, you might get insights that don’t make it into the paper, which can be the beginning of a great new idea. It’s fast mixing of chains.

Perhaps I will begin to feel differently about these exhausting sessions as I attend more meetings, and will become less inclined to spend hours among the halls. But for now, the opportunity to connect with such smart, committed people and to tell them personally how much I appreciate their hard work is my favourite part of NIPS. I hope that never changes.

I’ll avoid commenting on any of the talks since, invited speakers apart, each talk is derived from a poster. There are over 100 on display each session, and four sessions in total. I’ll report my favourite poster from each session. You can find all the posters with links to the papers here.

Monday’s poster: Improved Multimodal Deep Learning with Variation of Information [Mon59]

Deep learning has been successfully applied to multimodal representation learning problems, with a common strategy to learning joint representations that are shared across multiple modalities on top of layers of modality-specific networks. Nonetheless, there still remains a question how to learn a good association between data modalities; in particular, a good generative model of multimodal data should be able to reason about missing data modality given the rest of data modalities. In this paper, we propose a novel multimodal representation learning framework that explicitly aims this goal. Rather than learning with maximum likelihood, we train the model to minimize the variation of information. We provide a theoretical insight why the proposed learning objective is sufficient to estimate the data-generating joint distribution of multimodal data. We apply our method to restricted Boltzmann machines and introduce learning methods based on contrastive divergence and multi-prediction training. In addition, we extend to deep networks with recurrent encoding structure to finetune the whole network. In experiments, we demonstrate the state-of-the-art visual recognition performance on MIR-Flickr database and PASCAL VOC 2007 database with and without text features.

Why it’s cool: Multi-modal data is commonplace in computational biology, so I’m all for models that incorporate it. What made this paper stand out for me was that they did not try to learn the model parameters by maximizing the likelihood of the data.

Tuesday’s poster: Nonparametric Bayesian inference on multivariate exponential families [Tue25]

We develop a model by choosing the maximum entropy distribution from the set of models satisfying certain smoothness and independence criteria; we show that inference on this model generalizes local kernel estimation to the context of Bayesian inference on stochastic processes. Our model enables Bayesian inference in contexts when standard techniques like Gaussian process inference are too expensive to apply. Exact inference on our model is possible for any likelihood function from the exponential family. Inference is then highly efficient, requiring only O(log N) time and O(N) space at run time. We demonstrate our algorithm on several problems and show quantifiable improvement in both speed and performance relative to models based on the Gaussian process.

Why it’s cool: They promise to deliver a really big reward: exact inference for *any* likelihood function in the exponential family. Kernel density estimation for non-parametric Bayesian models? Far out.

Wednesday’s poster: Identifying and attacking the saddle point problem in high-dimensional non-convex optimization [Web39]

A central challenge to many fields of science and engineering involves minimizing non-convex error functions over continuous, high dimensional spaces. Gradient descent or quasi-Newton methods are almost ubiquitously used to perform such minimizations, and it is often thought that a main source of difficulty for these local methods to find the global minimum is the proliferation of local minima with much higher error than the global minimum. Here we argue, based on results from statistical physics, random matrix theory, neural network theory, and empirical evidence, that a deeper and more profound difficulty originates from the proliferation of saddle points, not local minima, especially in high dimensional problems of practical interest. Such saddle points are surrounded by high error plateaus that can dramatically slow down learning, and give the illusory impression of the existence of a local minimum. Motivated by these arguments, we propose a new approach to second-order optimization, the saddle-free Newton method, that can rapidly escape high dimensional saddle points, unlike gradient descent and quasi-Newton methods. We apply this algorithm to deep or recurrent neural network training, and provide numerical evidence for its superior optimization performance.

Why it’s cool: Great work by Yann Dauphin and Razvan Pascanu, in conjunction with Surya Ganguli. They used arguments from statistical physics, and some neat experiments to demonstrate that what people originally thought to be bad local minima in the search for parameters in deep networks are actually saddle points which are hard to escape. They propose a modified quasi-Newton method which works well on large auto encoder methods . Even better, their code will soon be released in Theano. Maybe my favourite poster.

Thursday’s poster: Distributed Variational Inference in Sparse Gaussian Process Regression and Latent Variable Models [Thu63]

Gaussian processes (GPs) are a powerful tool for probabilistic inference over functions. They have been applied to both regression and non-linear dimensionality reduction, and offer desirable properties such as uncertainty estimates, robustness to over-fitting, and principled ways for tuning hyper-parameters. However the scalability of these models to big datasets remains an active topic of research. We introduce a novel re-parametrisation of variational inference for sparse GP regression and latent variable models that allows for an efficient distributed algorithm. This is done by exploiting the decoupling of the data given the inducing points to re-formulate the evidence lower bound in a Map-Reduce setting. We show that the inference scales well with data and computational resources, while preserving a balanced distribution of the load among the nodes. We further demonstrate the utility in scaling Gaussian processes to big data. We show that GP performance improves with increasing amounts of data in regression (on flight data with 2 million records) and latent variable modelling (on MNIST). The results show that GPs perform better than many common models often used for big data.

Why it’s cool: Distrubuted computation of GP and GPLVMs. Their key insight is that, conditioned on inducing points, the observations decouple, so p(F_i | X, u) and p(Y_i | F_i) can be evaluated in parallel for every i.

That’s all for now. If you enjoy Montreal, and are thinking of attending NIPS next year, then rejoice; it will be located in Montreal again. The Palais des Congrès was a bright, spacious, easily accessible venue. I can’t wait to return.