# Bayesian non-parametric models for number of component estimation: bad?

I’ve been reading several different resources on Bayesian non-parametric models (abbrv. NPB), mostly for clustering.  One of the well-communicated benefits of Bayesian non-parametric models for clustering is that you do not have to specify the parametric capacity (for lack of a better term) of the model; inference of the model subsumes this.  So I was really surprised to read Peter Orbanz critique of choosing Bayesian non-parametric models by reason of elegant model selection (section 2.9, pp 19-20). According to him, this is wrong:
An old and very important problem in mixture modeling is how to select the number of mixture components, i.e. the order K of the mixture. Whether and how DP mixtures and similar models provide a solution to this problem is one of the most commonly misunderstood aspects of Bayesian nonparametrics:
Bayesian nonparametric mixtures are not a tool for automatically selecting the number of components in a finite mixture. If we assume that the number of clusters exhibited in the underlying population is finite, Bayesian nonparametric mixtures are misspecified models.
Why the surprise?  Well, in this 2011 NIPS tutorial, given by Yee Whye Teh & Peter Orbanz, one of the benefits of NPB models is their elegant approach to model selection!  Another introductory review of these models, by Samuel Gershman & David Blei, opens by extolling the virtuous qualities of NPB models for model selection in clustering. Why this repudiation NPB for model selection?
Orbanz develops his argument as follows:
I will try to argue this point in more detail: If we use a DP mixture on a sample of size n, then in any clustering solution supported by the posterior, a random, finite number K_n ≤ n of clusters is present. Hence, we obtain a posterior distribution on the number of clusters. Although this is not technically model selection—since there is just one model, under which the different possible values of K_n are mutually exclusive events—we indeed have obtained a solution for the number of clusters. However, a DP random measure has an infinite number of atoms almost surely.  Hence, the modeling assumption implicit in a DP mixture is that
as n → ∞, we will inevitably (with probability 1) observe an infinite number of clusters.

To choose an adequate strategy for determining the number of components, we have to distinguish three types of problems:
(1) K is finite and known, which is the assumption expressed by a finite mixture model of order K.
(2) K is finite but unknown. In this case, the appropriate model would be a
finite mixture model of unknown order. This problem should not be modeled
by a DP or other infinite mixture.
(3) K is infinite, as assumed by the Dirichlet process/CRP or the Pitman-Yor
process.
In many clustering problems, (3) is really the appropriate assumption: In topic modeling problems, for example, we assume that a given text corpus contains a finite number of topics, but that is only because the corpus itself is finite. If the corpus size increases, we would expect new topics to emerge, and it would usually be very difficult to argue that the number of topics eventually runs up against some finite bound and remains fixed.
What if we really have reason to assume that (2) is adequate? In a classical
mixture-model setup—estimate a finite mixture by approximate maximum likelihood using an EM algorithm, say—choosing K in (2) is a model selection problem, since different K result in different models whose likelihoods are not comparable. This problem is also called the model order selection problem, and popular solutions include penalty methods (AIC or BIC), stability, etc.
Orbanz goes on to cite a paper by Jeffery Miller and Matthew Harrison, who show that under certain (but common) conditions,  the posterior distributions of Dirichlet Process mixture and Pitman-Yor mixture models are not consistent; the posterior will not concentrate on the true number of mixture components with probability one.  I won’t reproduce their results here, the MathJax support for Evernote is not great on the web interface.  They do, however, offer this remark:
We would like to emphasize that this inconsistency should not be viewed as a deficiency of Dirichlet process mixtures, but is simply due to a misapplication of them. As flexible priors on densities, DPMs are superb, and there are strong results showing that in many cases the posterior on the density converges in L1 to the true density at the minimax-optimal rate, up to a logarithmic factor (see Scricciolo (2012), Ghosal (2010) and references therein). Further, Nguyen (2013) has recently shown that the posterior on the mixing distribution converges in the Wasserstein metric to the true mixing distribution. However, these results do not necessarily imply consistency for the number of components, since any mixture can be approximated arbitrarily well in these metrics by another mixture with a larger number of components (for instance, by making the weights of the extra components infinitesimally small).
This seems to be the problem: non-empty but very small superfluous clusters.  This has been studied both in theory and borne out in experiments before.  The authors continue:
This overestimation seems to occur because there are typically a few tiny “extra” clusters.  Among researchers using DPMs for clustering, this is an annoyance that is often dealt with by pruning such clusters — that is, by simply ignoring them when calculating statistics such as the number of clusters. It may be possible to obtain consistent estimators in this way, but this remains an open question
Q: What does this all mean for the rest of us who want to cluster their data without running an outer loop for model selection?
A: I’ll read the Rousseau and Mengersen paper to see what they report, but I think that it’s still ok to use NPB to get a good idea of how many components are in a finite mixture, even if Orbanz’ condition (3) does not hold.  What’s most important for researchers looking to understand their data set is not that an NPB posterior be concentrated almost surely on the true number of components, but that it be *mostly* right about the number of components.  As long as the number of superfluous components is small, and their associated clusters also small, it’s not a big deal.  Using NPB has the extra virtue of forcing users to  think harder about aspects of the model (specifying base measures, prior assumptions), which simpler clustering models like k-means or mixture of Gaussians do not.

# A belated MLSB 2013 retrospective

MLSB is a special interest group meeting that is co-located with ISMB, the Intelligent Systems for Molecular Biology conference. MLSB is devoted to Machine Learning research in the area of Systems Biology. It brings together machine learning folks, statisticians, geneticists, and biologists; it was a great place to strike up discussion about how to do better analysis on biological data. The full program is available online, so here I’m going to write up a few highlights for each day.

Friday

Markus Heinonen gave a solid talk about time dependent GP regression and significance testing for sparse time series. His group was examining the behaviour of cells under irradiation: how does this treatment affect both diseased and healthy cells with respect to gene regulation, cell processes, and the development of radiation disease. They pose this as trying to determine which genes / proteins are differentially expressed in irradiated cells over a time course. He applied GP regression to time series, using significance tests for sparse time series to determine significant expression (or changes). A nice piece of work, but they didn’t provide a citation since it was probably not yet accepted. But I found a link via google scholar.

Lani Wu gave a good invited talk about two recent stories from their lab involving neutrophil locomotion: one about perturbation analysis, and the other about cellular heterogeneity. There wasn’t anything really new in their methods, but what made an impression was the thoroughness of their analysis. Their lab has a great record of turning out quality work. As a fellow high throughput image biologist, who knows all the pain involved when trying to extract meaning from images (which are usually generated in a really noisy way), I was impressed 🙂

Quaid Morris gave a concise and informative talk, presenting work by Shankar that extended Sara Mostafavi‘s work on 3Prop. It addresses function prediction where the input is a set of heterogeneous networks, as well as a group of genes, and the desired outputs are labels for the set of query genes. Their method is called LMGraph, and it incorporates feature information into the function prediction algorithm. They extract discriminative features from all available networks, and combine them via a simple weighting scheme. This has nice properties: it’s highly scalable, using efficient linear classifiers, and it combines both network with feature data to learn interpretable models.

Saturday

Johannes Stephan gave a cool talk about Mixed Random Forests. They developed a new model intended for GWAS data, to try and detect functional relationship between phenotype and genotype while accounting for confounding factors due to population structure.

The standard approach to addressing population structure is to use a linear mixed model:
$y = x\beta + \mu + \epsilon$
$\mu \sim N(0,\sigma^{2}_{g}K)$ where K is the structured covariance matrix, $\epsilon \sim N(0, \sigma^{2}_{v}I)$
Then do a log ratio test on Beta coefficients.

A more difficult model is one where many SNPs that have interaction effects which are non-linear
$f_s(X) = X_1\beta_1 + X_2\beta_2 + \ldots$

Their idea is to combine linear mixed models with tree based models to infer $f_s(X)$ in the presence of the population effect. The nice thing about growing trees is that the method is fast, but are unstable with regards to small changes in input (highly discontinuous). This instability can be mitigated by subsampling and building many trees (RFs). The take home message is to use RF for feature selection, they use linear mixed models for prediction.

Finally, John Marioni gave a really interesting technical talk about the challenges in single cell RNA sequencing. The big idea is: multiple cells from a population -> single cell RNA-seq -> correlate patterns across cells -> infer cell type.

The big challenges remaining:

• quantify variability, identify highly variable genes
• visualization and clustering
• downstream interpretation

John described several technical aspects of experiments in single cell RNA-seq. Long story short: technical variability is still a problem, e.g count value ranges quickly increase when technical noise is large. One example nice bit of work was their work to estimate kinetic model parameters for stochastic models of gene expression from single cell RNA expression data. This talk again didn’t contain anything so new and shiny, but the quality and thoroughness of the work presented that captured my attention. This was my first exposure to real data from single cell RNA seq experiments, which alone was enough to impress me.