NIPS2014 de-brief: the Friday workshops

Caveat lector: I began composing this back in January, then set it aside to work on thesis stuff. It has been overlooked ever since. NIPS is getting pretty far in the rear-view, but the workshops this year were excellent, and deserve some mention. So, though it’s under-prepared, and looking back I see I didn’t take nearly as many notes as I should have, I’m going to force this post out now. Please don’t interpret the sparsity of my notes as lack of interest.

The last two days of NIPS2014 were all about the workshops. On Friday, I attended the inaugural offering of Machine Learning for Clinical Data Analysis, as well as the workshop on Deep Learning and Representation Learning.

Machine Learning for Clinical Data Analysis

I wish I had taken better notes from the morning session of ml4chg. The invited talk by Gilles Clermont was structured around trying to find the right questions to ask with respect to using machine learning in clinical settings. The ICU is expensive, and technologically intensive. Mortality is high, and risk is high. Doctors want to be convinced beyond just figures showing AUPR, AUROC are good. They want to be convinced that we can use fancy computer models to reduce errors, and manage data. I left thinking that there remain a lot of questions to be answered (and relationships between clinicians and researchers to be built) before any rubber hits the road. The Q&A session that finished off the morning underscored this last point.

Deep Learning

I arrived at the first (!) poster session. Here were a couple that caught my eye.

Invited talk by Rich Schwartz (BBN): Fast and Robust Shallow Neural Networks Joint Models for Machine Translation

  • shallowest ngram translation model. Simpler was better, but no phrases, no previous translations that were incorporated.
  • Yoshua had loads of ideas to improve what was presented.

Invited talk by Vlad Mnih (Google DeepMind): Deep Reinforcement Learning and REINFORCEd Deep Learning:

  • RL and deep learning should be better friends.
  • big problems: noisy, delayed signals. Nns don’t necessarily work well in non stationary environments, which is the case in RL
  • Q learning updates: smoothed action reward scheme. Froze params of network that generated the target actions, periodically refreshed this.
  • another nice thing about RL: add non-differentiable components for NNs using reinforce.
  • attention model reduces the amount of computation for convnets, fixes the amount of conv

Invited talk by Surya Ganguli (Stanford): From statistical physics to deep learning: on the beneficial role of dynamic criticality, random landscapes, and the reversal of time.

  • good stuff from statistical physics that can help deep learning (published work on his website)
  • theory of deep learning (initializing weight dynamics theory), says number of epochs does not need to increase with depth, if chosen properly.
  • questions from: Geoff Hinton, Then Yann Lecun, then Yoshua Bengio. Got a big laugh from the audience, but was in itself instructive; the three wise men all believe that physics has something to teach us about the mechanics of error surfaces in deep learning.

That’s it. There were a few photos of posters that I wanted to recall the details later that I won’t post, since while I asked for permission to take the photos, I didn’t think to ask permission to share them. The Saturday workshops were all about MLCB for me, to which I’ll devote a separate post.

NIPS2014 de-brief: the main conference

NIPS 2014 has come and gone. My brain is full of ideas, I’ve met a host of new people, and my body feels shattered by the 7:30 to 12:00 schedule. This was my second time attending NIPS, and this meeting has strengthened my opinion that the most valuable part of the conference are the posters. The poster sessions are the first-class content. They’re interactive, all the papers are online, and there’s no scheduling problems. Just go around, find work that suits you, and meet the people responsible. For such a large single track conference, they’re indispensable. While you can read the papers online from anywhere, it is nowhere near so rewarding as speaking to the authors directly. If you ask the right question, and the author isn’t too wiped out by the previous four hours of Q&A, you might get insights that don’t make it into the paper, which can be the beginning of a great new idea. It’s fast mixing of chains.

Perhaps I will begin to feel differently about these exhausting sessions as I attend more meetings, and will become less inclined to spend hours among the halls. But for now, the opportunity to connect with such smart, committed people and to tell them personally how much I appreciate their hard work is my favourite part of NIPS. I hope that never changes.

I’ll avoid commenting on any of the talks since, invited speakers apart, each talk is derived from a poster. There are over 100 on display each session, and four sessions in total. I’ll report my favourite poster from each session. You can find all the posters with links to the papers here.

Monday’s poster: Improved Multimodal Deep Learning with Variation of Information [Mon59]

Deep learning has been successfully applied to multimodal representation learning problems, with a common strategy to learning joint representations that are shared across multiple modalities on top of layers of modality-specific networks. Nonetheless, there still remains a question how to learn a good association between data modalities; in particular, a good generative model of multimodal data should be able to reason about missing data modality given the rest of data modalities. In this paper, we propose a novel multimodal representation learning framework that explicitly aims this goal. Rather than learning with maximum likelihood, we train the model to minimize the variation of information. We provide a theoretical insight why the proposed learning objective is sufficient to estimate the data-generating joint distribution of multimodal data. We apply our method to restricted Boltzmann machines and introduce learning methods based on contrastive divergence and multi-prediction training. In addition, we extend to deep networks with recurrent encoding structure to finetune the whole network. In experiments, we demonstrate the state-of-the-art visual recognition performance on MIR-Flickr database and PASCAL VOC 2007 database with and without text features.

Why it’s cool: Multi-modal data is commonplace in computational biology, so I’m all for models that incorporate it. What made this paper stand out for me was that they did not try to learn the model parameters by maximizing the likelihood of the data.

Tuesday’s poster: Nonparametric Bayesian inference on multivariate exponential families [Tue25]

We develop a model by choosing the maximum entropy distribution from the set of models satisfying certain smoothness and independence criteria; we show that inference on this model generalizes local kernel estimation to the context of Bayesian inference on stochastic processes. Our model enables Bayesian inference in contexts when standard techniques like Gaussian process inference are too expensive to apply. Exact inference on our model is possible for any likelihood function from the exponential family. Inference is then highly efficient, requiring only O(log N) time and O(N) space at run time. We demonstrate our algorithm on several problems and show quantifiable improvement in both speed and performance relative to models based on the Gaussian process.

Why it’s cool: They promise to deliver a really big reward: exact inference for *any* likelihood function in the exponential family. Kernel density estimation for non-parametric Bayesian models? Far out.

Wednesday’s poster: Identifying and attacking the saddle point problem in high-dimensional non-convex optimization [Web39]

A central challenge to many fields of science and engineering involves minimizing non-convex error functions over continuous, high dimensional spaces. Gradient descent or quasi-Newton methods are almost ubiquitously used to perform such minimizations, and it is often thought that a main source of difficulty for these local methods to find the global minimum is the proliferation of local minima with much higher error than the global minimum. Here we argue, based on results from statistical physics, random matrix theory, neural network theory, and empirical evidence, that a deeper and more profound difficulty originates from the proliferation of saddle points, not local minima, especially in high dimensional problems of practical interest. Such saddle points are surrounded by high error plateaus that can dramatically slow down learning, and give the illusory impression of the existence of a local minimum. Motivated by these arguments, we propose a new approach to second-order optimization, the saddle-free Newton method, that can rapidly escape high dimensional saddle points, unlike gradient descent and quasi-Newton methods. We apply this algorithm to deep or recurrent neural network training, and provide numerical evidence for its superior optimization performance.

Why it’s cool: Great work by Yann Dauphin and Razvan Pascanu, in conjunction with Surya Ganguli. They used arguments from statistical physics, and some neat experiments to demonstrate that what people originally thought to be bad local minima in the search for parameters in deep networks are actually saddle points which are hard to escape. They propose a modified quasi-Newton method which works well on large auto encoder methods . Even better, their code will soon be released in Theano. Maybe my favourite poster.

Thursday’s poster: Distributed Variational Inference in Sparse Gaussian Process Regression and Latent Variable Models [Thu63]

Gaussian processes (GPs) are a powerful tool for probabilistic inference over functions. They have been applied to both regression and non-linear dimensionality reduction, and offer desirable properties such as uncertainty estimates, robustness to over-fitting, and principled ways for tuning hyper-parameters. However the scalability of these models to big datasets remains an active topic of research. We introduce a novel re-parametrisation of variational inference for sparse GP regression and latent variable models that allows for an efficient distributed algorithm. This is done by exploiting the decoupling of the data given the inducing points to re-formulate the evidence lower bound in a Map-Reduce setting. We show that the inference scales well with data and computational resources, while preserving a balanced distribution of the load among the nodes. We further demonstrate the utility in scaling Gaussian processes to big data. We show that GP performance improves with increasing amounts of data in regression (on flight data with 2 million records) and latent variable modelling (on MNIST). The results show that GPs perform better than many common models often used for big data.

Why it’s cool: Distrubuted computation of GP and GPLVMs. Their key insight is that, conditioned on inducing points, the observations decouple, so p(F_i | X, u) and p(Y_i | F_i) can be evaluated in parallel for every i.

That’s all for now. If you enjoy Montreal, and are thinking of attending NIPS next year, then rejoice; it will be located in Montreal again. The Palais des Congrès was a bright, spacious, easily accessible venue. I can’t wait to return.

Install CellProfilerAnalyst onto a debian / ubuntu workstation

I received an email yesterday from someone asking me for advice on how to install CellProfilerAnalyst (CPA) on a linux workstation. It’s a program that allows you to inspect, manipulate and analyze data from cell images. It accompanies CellProfiler, which is a great tool from the Carpenter lab at the Broad Institute. If you’re doing work that involves fluorescence imaging of cells in some aspect, I recommend taking a look.

I’ve been through this process a few times, and while it’s not a picnic, it’s much easier these days. In the past, installation involved horrendous things like compiling and tuning the ATLAS library (it’s an implementation of blas; takes days to do on a workstation). Today all the packages you need are available from some combination of your package manager and pip. Here are the steps I used to install CPA.

** Note: the original request was for an Amazon EC2 instance of ubuntu. Make sure that you choose to launch an instance with a contemporary offering of Ubuntu to get all the packages you’ll need. I used Debian 7.0 (Wheezy), but something like Ubuntu 14.04 or 12.04 should work. I suspect your sources.list file should have Universe / Multiverse repositories enabled, to get the packages you need. **

  1. Launch the instance, and perform apt-get update; apt-get upgrade to refresh your pre-installed packages.
  2. Install the following packages based on Option 2. The instructions are geared towards installing CellProfiler, but for running CPA you should still install the following: WxPython, Cython, Java development kit, MySQLdb, Numpy, Scipy, ZMQ, Matplotlib. The debian package names for these are: python-zmq, mysql-server, python-numpy, python-scipy, cython, python-wxgtk2.8, java-package, python-matplotlib. The ubuntu packages names should be either identical or very similar. You can use apt-get install to install all these. There will probably be other package dependencies that apt will install.

  3. An aside concerning the Java install: java-package is a meta package which will download the latest version of java from Oracle, compile it and create a .deb package (say, java_foo.deb) for your system to install. If apt does not install the package, you may install it with dpkg -i java_foo.deb.

    It is the only package you should have to install (the java-sun-* listed in the wiki are deprecated, ignore them). It’s easy to manage this way; the install instructions (step 8 of Option 2) are a bit messy, so I’d recommend beginning with installing just java-package, and installing the subsequent .deb produced. Then continue with the remaining steps in installing prerequisite packages for CPA. If it fails with some java related error, try making the suggested changes in configuration files or environment variables. Less is more when you’re adding to $PATH or $LD_LIBRARY_PATH, in my opinion.


  4. Use git to grab the latest version of CPA: git clone https://github.com/CellProfiler/CellProfiler-Analyst.git. If your instance doesn’t have git installed by default, apt-get install git. Then cd into the newly created CellProfiler-Analyst directory.
  5. Use pip to install the remaining required packages: pip install python-bioformats; pip install mock; pip install verlib. Installing python-bioformats should also install a python package called javabridge. If pip is not installed by default, apt-get install python-pip.
  6. python CellProfiler-Analyst.py should start CPA. It may instead begin to issue a bunch of text in the terminal about modules being cythonized. As long as you eventually see the CPA logo in a pop-up screen followed by a request to pick a properties file to open, you’ve installed successfully.

Bayesian non-parametric models for number of component estimation: bad?

I’ve been reading several different resources on Bayesian non-parametric models (abbrv. NPB), mostly for clustering.  One of the well-communicated benefits of Bayesian non-parametric models for clustering is that you do not have to specify the parametric capacity (for lack of a better term) of the model; inference of the model subsumes this.  So I was really surprised to read Peter Orbanz critique of choosing Bayesian non-parametric models by reason of elegant model selection (section 2.9, pp 19-20). According to him, this is wrong:
An old and very important problem in mixture modeling is how to select the number of mixture components, i.e. the order K of the mixture. Whether and how DP mixtures and similar models provide a solution to this problem is one of the most commonly misunderstood aspects of Bayesian nonparametrics:
Bayesian nonparametric mixtures are not a tool for automatically selecting the number of components in a finite mixture. If we assume that the number of clusters exhibited in the underlying population is finite, Bayesian nonparametric mixtures are misspecified models.
Why the surprise?  Well, in this 2011 NIPS tutorial, given by Yee Whye Teh & Peter Orbanz, one of the benefits of NPB models is their elegant approach to model selection!  Another introductory review of these models, by Samuel Gershman & David Blei, opens by extolling the virtuous qualities of NPB models for model selection in clustering. Why this repudiation NPB for model selection?
Orbanz develops his argument as follows:
I will try to argue this point in more detail: If we use a DP mixture on a sample of size n, then in any clustering solution supported by the posterior, a random, finite number K_n ≤ n of clusters is present. Hence, we obtain a posterior distribution on the number of clusters. Although this is not technically model selection—since there is just one model, under which the different possible values of K_n are mutually exclusive events—we indeed have obtained a solution for the number of clusters. However, a DP random measure has an infinite number of atoms almost surely.  Hence, the modeling assumption implicit in a DP mixture is that
as n → ∞, we will inevitably (with probability 1) observe an infinite number of clusters.
 
To choose an adequate strategy for determining the number of components, we have to distinguish three types of problems:
(1) K is finite and known, which is the assumption expressed by a finite mixture model of order K.
(2) K is finite but unknown. In this case, the appropriate model would be a
finite mixture model of unknown order. This problem should not be modeled
by a DP or other infinite mixture.
(3) K is infinite, as assumed by the Dirichlet process/CRP or the Pitman-Yor
process.
In many clustering problems, (3) is really the appropriate assumption: In topic modeling problems, for example, we assume that a given text corpus contains a finite number of topics, but that is only because the corpus itself is finite. If the corpus size increases, we would expect new topics to emerge, and it would usually be very difficult to argue that the number of topics eventually runs up against some finite bound and remains fixed.
What if we really have reason to assume that (2) is adequate? In a classical
mixture-model setup—estimate a finite mixture by approximate maximum likelihood using an EM algorithm, say—choosing K in (2) is a model selection problem, since different K result in different models whose likelihoods are not comparable. This problem is also called the model order selection problem, and popular solutions include penalty methods (AIC or BIC), stability, etc.
Orbanz goes on to cite a paper by Jeffery Miller and Matthew Harrison, who show that under certain (but common) conditions,  the posterior distributions of Dirichlet Process mixture and Pitman-Yor mixture models are not consistent; the posterior will not concentrate on the true number of mixture components with probability one.  I won’t reproduce their results here, the MathJax support for Evernote is not great on the web interface.  They do, however, offer this remark:
We would like to emphasize that this inconsistency should not be viewed as a deficiency of Dirichlet process mixtures, but is simply due to a misapplication of them. As flexible priors on densities, DPMs are superb, and there are strong results showing that in many cases the posterior on the density converges in L1 to the true density at the minimax-optimal rate, up to a logarithmic factor (see Scricciolo (2012), Ghosal (2010) and references therein). Further, Nguyen (2013) has recently shown that the posterior on the mixing distribution converges in the Wasserstein metric to the true mixing distribution. However, these results do not necessarily imply consistency for the number of components, since any mixture can be approximated arbitrarily well in these metrics by another mixture with a larger number of components (for instance, by making the weights of the extra components infinitesimally small).
This seems to be the problem: non-empty but very small superfluous clusters.  This has been studied both in theory and borne out in experiments before.  The authors continue:
This overestimation seems to occur because there are typically a few tiny “extra” clusters.  Among researchers using DPMs for clustering, this is an annoyance that is often dealt with by pruning such clusters — that is, by simply ignoring them when calculating statistics such as the number of clusters. It may be possible to obtain consistent estimators in this way, but this remains an open question
Q: What does this all mean for the rest of us who want to cluster their data without running an outer loop for model selection?
A: I’ll read the Rousseau and Mengersen paper to see what they report, but I think that it’s still ok to use NPB to get a good idea of how many components are in a finite mixture, even if Orbanz’ condition (3) does not hold.  What’s most important for researchers looking to understand their data set is not that an NPB posterior be concentrated almost surely on the true number of components, but that it be *mostly* right about the number of components.  As long as the number of superfluous components is small, and their associated clusters also small, it’s not a big deal.  Using NPB has the extra virtue of forcing users to  think harder about aspects of the model (specifying base measures, prior assumptions), which simpler clustering models like k-means or mixture of Gaussians do not.

Debian wheezy swig2.0-examples package missing configure file

This post is esoteric, but in the off-chance someone else needs it, hopefully google brought you here.

I’m looking into using SWIG to wrap up a c++ app in python. How to do this for more than a small one class or one file code-base is far from clear. So I tried installing the swig2.0-examples debian package for Wheezy. My progress went like this:

  1. cd into /usr/share/doc/swig2.0-examples/python/import
  2. sudo make
  3. It doesn’t work

All sorts of compiler / make complaints about undefined symbols such as @CXX@. What’s up with that? It turns out the Makefile depends on a master makefile in /usr/share/doc/swig2.0-examples/, which is missing. There *is* an autoconf generated Makefile.in, but no configure script to define the flags for the given host system. It turns out that downloading the source package and running the configure script that is included therein will generate the Makefiles needed:

$ wget http://ftp.de.debian.org/debian/pool/main/s/swig2.0/swig2.0_2.0.7.orig.tar.gz
$ gunzip swig2.0_2.0.7.orig.tar.gz && tar xvf swig2.0_2.0.7.orig.tar
$ cd swig-2.0.7 && ./configure
$ sudo cp Makefile /usr/share/doc/swig2.0-examples

Et voila, the build now succeeds:

lee@beehive:/usr/share/doc/swig2.0-examples/python/import$ sudo make
make -f ../../Makefile SWIG='swig' SWIGOPT='' \
LIBS='' TARGET='base' INTERFACE='base.i' python_cpp
make[1]: Entering directory `/usr/share/doc/swig2.0-examples/python/import'
swig -python -c++ base.i
g++ -c -fpic base_wrap.cxx -I/usr/include/python2.7 -I/usr/lib/python2.7/config
g++ -shared base_wrap.o -o _base.so
make[1]: Leaving directory `/usr/share/doc/swig2.0-examples/python/import'
make -f ../../Makefile SWIG='swig' SWIGOPT='' \
LIBS='' TARGET='foo' INTERFACE='foo.i' python_cpp
make[1]: Entering directory `/usr/share/doc/swig2.0-examples/python/import'
swig -python -c++ foo.i
g++ -c -fpic foo_wrap.cxx -I/usr/include/python2.7 -I/usr/lib/python2.7/config
g++ -shared foo_wrap.o -o _foo.so
make[1]: Leaving directory `/usr/share/doc/swig2.0-examples/python/import'
make -f ../../Makefile SWIG='swig' SWIGOPT='' \
LIBS='' TARGET='bar' INTERFACE='bar.i' python_cpp
make[1]: Entering directory `/usr/share/doc/swig2.0-examples/python/import'
swig -python -c++ bar.i
g++ -c -fpic bar_wrap.cxx -I/usr/include/python2.7 -I/usr/lib/python2.7/config
g++ -shared bar_wrap.o -o _bar.so
make[1]: Leaving directory `/usr/share/doc/swig2.0-examples/python/import'
make -f ../../Makefile SWIG='swig' SWIGOPT='' \
LIBS='' TARGET='spam' INTERFACE='spam.i' python_cpp
make[1]: Entering directory `/usr/share/doc/swig2.0-examples/python/import'
swig -python -c++ spam.i
g++ -c -fpic spam_wrap.cxx -I/usr/include/python2.7 -I/usr/lib/python2.7/config
g++ -shared spam_wrap.o -o _spam.so
make[1]: Leaving directory `/usr/share/doc/swig2.0-examples/python/import'

The Bit-Bucket Challenge

The ALS Ice Bucket Challenge is sweeping the internet, and people are busy coming up with crafty and audacious ways to dump icy water on their heads. I’d like to propose a lesser challenge, one that would help people who work with data everywhere. I call it the bit-bucket challenge. In the words of Godard:

it’s not important where you steal from, it’s where you bring it to

So, here’s my proposal:

ML researchers of the world: please, please, please, release your code under a permissive (BSD,MIT, maybe L-GPL) license.

No need to make a video, no need to agonize over the fine line between slackto-narcisissism, and do-goodery. None of that. This is *much* easier. Here’s how to do it:

  1. Sign up on GitHub, or bit-bucket, or source-forge, or make your own webpage. Just find somewhere to host your project.
  2. Take the readme.txt you stuck in the main working directory of your project, and annotate that in markdown. It should tell people how to build your code, and make note of any obvious stumbling points in the code. Take this time to write one, if you haven’t already done so.
  3. Upload all of that onto the internet (see step 1).
  4. Make a blog post or a tweet briefly describing what you’ve just posted.
  5. Tag three colleagues in your post, challenging them to release a project that they’ve published as a paper.

That’s it.

Is your code ugly? Full of hacks? Hard coded values? Bubble sort? Do not let this deter you; if your code is the least bit functional, someone who has read your paper and would like to try your algorithm will take that mess and improve it. They will patch it, refactor it or make other radical changes to suit their purpose. Chances are good that they’ll share these improvements with you, gratis! Heck, they might even write new tests for your code (you have some tests, right?). But they can’t do that unless you give them the chance.

Why is this a worthwhile challenge? Well, re-implementing a scientific paper is hard. Also, software is now firmly entrenched in the scientific process. If you’re doing computational method development and just writing about your results, I don’t think you’re doing it properly.

“if it’s not open and verifiable by others, it’s not science, or engineering, or whatever it is you call what we do.” (V. Stodden, The scientific method in practice)

(h/t to Gael Varoquaux)

Worried that no one will care about your code? You’re in good company. There are plenty of projects out there not used by anyone other than the author, or algorithms of marginal value. Believe me, I’ve written plenty. But on the off-chance that someone *does* find your code useful, you’ve just saved this person many, many hours of work in trying to re-implement your algorithm from scratch.

So please, dear friends, let’s all share some code. Just do it. Do it for the betterment of us all.

quick reminder post: pip install prefix

Say you want to install a python package using pip, but you don’t have sudo privileges or root, so you can’t use your site-packages directory, and instead you want to specify some directory below ~. For some reason, the man page for pip does not have the syntax to do this, so I’ll leave it here:

pip install --install-option="--prefix=/path/to/prefix/dir" foo

This works nicely on SciNet, in conjunction with their advice about setting up a local python package directory.