An old and very important problem in mixture modeling is how to select the number of mixture components, i.e. the order K of the mixture. Whether and how DP mixtures and similar models provide a solution to this problem is one of the most commonly misunderstood aspects of Bayesian nonparametrics:Bayesian nonparametric mixtures arenota tool for automatically selecting the number of components in a finite mixture. If we assume that the number of clusters exhibited in the underlying population is finite, Bayesian nonparametric mixtures are misspecified models.

I will try to argue this point in more detail: If we use a DP mixture on a sample of size n, then in any clustering solution supported by the posterior, a random, finite number K_n ≤ n of clusters is present. Hence, we obtain a posterior distribution on the number of clusters. Although this is not technically model selection—since there is just one model, under which the different possible values of K_n are mutually exclusive events—we indeed have obtained a solution for the number of clusters. However, a DP random measure has an infinite number of atoms almost surely. Hence, the modeling assumption implicit in a DP mixture is thatas n → ∞, we will inevitably (with probability 1) observe an infinite number ofclusters.To choose an adequate strategy for determining the number of components, we have to distinguish three types of problems:(1) K isfinite and known, which is the assumption expressed by a finite mixture model of order K.

(2) K isfinite but unknown. In this case, the appropriate model would be a

finite mixture model of unknown order. This problem should not be modeled

by a DP or other infinite mixture.

(3) K isinfinite, as assumed by the Dirichlet process/CRP or the Pitman-Yor

process.In many clustering problems, (3) is really the appropriate assumption: In topic modeling problems, for example, we assume that a given text corpus contains a finite number of topics, but that is only because the corpus itself is finite. If the corpus size increases, we would expect new topics to emerge, and it would usually be very difficult to argue that the number of topics eventually runs up against some finite bound and remains fixed.What if we really have reason to assume that (2) is adequate? In a classical

mixture-model setup—estimate a finite mixture by approximate maximum likelihood using an EM algorithm, say—choosing K in (2) is a model selection problem, since different K result in different models whose likelihoods are not comparable. This problem is also called the model order selection problem, and popular solutions include penalty methods (AIC or BIC), stability, etc.

We would like to emphasize that this inconsistency should not be viewed as a deficiency of Dirichlet process mixtures, but is simply due to a misapplication of them. As flexible priors on densities, DPMs are superb, and there are strong results showing that in many cases the posterior on the density converges in L1 to the true density at the minimax-optimal rate, up to a logarithmic factor (see Scricciolo (2012), Ghosal (2010) and references therein). Further, Nguyen (2013) has recently shown that the posterior on the mixing distribution converges in the Wasserstein metric to the true mixing distribution. However, these results do not necessarily imply consistency for the number of components, sinceany mixture can be approximated arbitrarily wellin these metrics by another mixture with a larger number of components(for instance, by making the weights of the extra components infinitesimally small).

This overestimation seems to occur because there are typically a few tiny “extra” clusters. Among researchers using DPMs for clustering, this is an annoyance that is often dealt with by pruning such clusters — that is, by simply ignoring them when calculating statistics such as the number of clusters. It may be possible to obtain consistent estimators in this way, but this remains an open question

**Q:**What does this all mean for the rest of us who want to cluster their data without running an outer loop for model selection?

**A:**I’ll read the Rousseau and Mengersen paper to see what they report, but I think that it’s still ok to use NPB to get a good idea of how many components are in a finite mixture, even if Orbanz’ condition (3) does not hold. What’s most important for researchers looking to understand their data set is not that an NPB posterior be concentrated almost surely on the true number of components, but that it be *mostly* right about the number of components. As long as the number of superfluous components is small, and their associated clusters also small, it’s not a big deal. Using NPB has the extra virtue of forcing users to think harder about aspects of the model (specifying base measures, prior assumptions), which simpler clustering models like k-means or mixture of Gaussians do not.