I’ve been reading several different resources on Bayesian non-parametric models (abbrv. NPB), mostly for clustering. One of the well-communicated benefits of Bayesian non-parametric models for clustering is that you do not have to specify the parametric capacity (for lack of a better term) of the model; inference of the model subsumes this. So I was really surprised to read Peter Orbanz critique
of choosing Bayesian non-parametric models by reason of elegant model selection (section 2.9, pp 19-20). According to him, this is wrong:
An old and very important problem in mixture modeling is how to select the number of mixture components, i.e. the order K of the mixture. Whether and how DP mixtures and similar models provide a solution to this problem is one of the most commonly misunderstood aspects of Bayesian nonparametrics:
Bayesian nonparametric mixtures are not a tool for automatically selecting the number of components in a finite mixture. If we assume that the number of clusters exhibited in the underlying population is finite, Bayesian nonparametric mixtures are misspecified models.
Why the surprise? Well, in this
2011 NIPS tutorial, given by Yee Whye Teh & Peter Orbanz, one of the benefits of NPB models is their elegant approach to model selection! Another introductory review
of these models, by Samuel Gershman & David Blei, opens by extolling the virtuous qualities of NPB models for model selection in clustering. Why this repudiation NPB for model selection?
Orbanz develops his argument as follows:
I will try to argue this point in more detail: If we use a DP mixture on a sample of size n, then in any clustering solution supported by the posterior, a random, finite number K_n ≤ n of clusters is present. Hence, we obtain a posterior distribution on the number of clusters. Although this is not technically model selection—since there is just one model, under which the different possible values of K_n are mutually exclusive events—we indeed have obtained a solution for the number of clusters. However, a DP random measure has an infinite number of atoms almost surely. Hence, the modeling assumption implicit in a DP mixture is that
as n → ∞, we will inevitably (with probability 1) observe an infinite number of clusters.
To choose an adequate strategy for determining the number of components, we have to distinguish three types of problems:
(1) K is finite and known, which is the assumption expressed by a finite mixture model of order K.
(2) K is finite but unknown. In this case, the appropriate model would be a
finite mixture model of unknown order. This problem should not be modeled
by a DP or other infinite mixture.
(3) K is infinite, as assumed by the Dirichlet process/CRP or the Pitman-Yor
In many clustering problems, (3) is really the appropriate assumption: In topic modeling problems, for example, we assume that a given text corpus contains a finite number of topics, but that is only because the corpus itself is finite. If the corpus size increases, we would expect new topics to emerge, and it would usually be very difficult to argue that the number of topics eventually runs up against some finite bound and remains fixed.
What if we really have reason to assume that (2) is adequate? In a classical
mixture-model setup—estimate a finite mixture by approximate maximum likelihood using an EM algorithm, say—choosing K in (2) is a model selection problem, since different K result in different models whose likelihoods are not comparable. This problem is also called the model order selection problem, and popular solutions include penalty methods (AIC or BIC), stability, etc.
Orbanz goes on to cite a paper
by Jeffery Miller and Matthew Harrison, who show that under certain (but common) conditions, the posterior distributions of Dirichlet Process mixture and Pitman-Yor mixture models are not consistent; the posterior will not concentrate on the true number of mixture components with probability one. I won’t reproduce their results here, the MathJax support for Evernote is not great on the web interface. They do, however, offer this remark:
We would like to emphasize that this inconsistency should not be viewed as a deficiency of Dirichlet process mixtures, but is simply due to a misapplication of them. As flexible priors on densities, DPMs are superb, and there are strong results showing that in many cases the posterior on the density converges in L1 to the true density at the minimax-optimal rate, up to a logarithmic factor (see Scricciolo (2012), Ghosal (2010) and references therein). Further, Nguyen (2013) has recently shown that the posterior on the mixing distribution converges in the Wasserstein metric to the true mixing distribution. However, these results do not necessarily imply consistency for the number of components, since any mixture can be approximated arbitrarily well in these metrics by another mixture with a larger number of components (for instance, by making the weights of the extra components infinitesimally small).
This seems to be the problem: non-empty but very small superfluous clusters. This has been studied both in theory
and borne out
in experiments before. The authors continue:
This overestimation seems to occur because there are typically a few tiny “extra” clusters. Among researchers using DPMs for clustering, this is an annoyance that is often dealt with by pruning such clusters — that is, by simply ignoring them when calculating statistics such as the number of clusters. It may be possible to obtain consistent estimators in this way, but this remains an open question
Q: What does this all mean for the rest of us who want to cluster their data without running an outer loop for model selection?
A: I’ll read the Rousseau and Mengersen paper to see what they report, but I think that it’s still ok to use NPB to get a good idea of how many components are in a finite mixture, even if Orbanz’ condition (3) does not hold. What’s most important for researchers looking to understand their data set is not that an NPB posterior be concentrated almost surely on the true number of components, but that it be *mostly* right about the number of components. As long as the number of superfluous components is small, and their associated clusters also small, it’s not a big deal. Using NPB has the extra virtue of forcing users to think harder about aspects of the model (specifying base measures, prior assumptions), which simpler clustering models like k-means or mixture of Gaussians do not.