Can the EMD work for me?

As a finishing touch on the paper we’re working on at present (concerning high throughput image based screening in a collection of yeast mutants subjected to different treatments, and the identification of interesting morphology), we would like to show that the data set we’ve painstakingly generated over the last year or so is akin to a mineral rich riverbed, perfect for  prospectors to labour in and extract gold.

As a proof of concept, we’ve been working to check out nuclear morphology differences.  My part of this task is to come up with a quick test to find ORFs whose nuclei display abnormal morphology.  At first pass, I thought that we could examine the differences in distribution for a surrogate of nuclear shape between different members of the deletion collection. Members that score well on such a test (e.g KL-divergence) would be likely candidates for further inspection.

But this might be too restrictive.  After all, we”ve measured 10 different area and shape features for nuclei, why not use more of them?  The KL-divergence is only defined for one dimensional distributions, so that’s out.  But I happened upon something analogous called the Wasserstein-1 distance, also known as the Earth Mover’s Distance.  From Wikipedia:

In probability theory, the earth mover’s distance (EMD) is a measure of the distance between two probability distributions over a region D. Informally, if the distributions are interpreted as two different ways of piling up a certain amount of dirt over the region D, the EMD is the minimum cost of turning one pile into the other; where the cost is assumed to be amount of dirt moved times the distance by which it is moved.

As a bonus, a package in R is available via R-Forge that computes it efficiently.  Now, some care has to be taken (see the link to the definition posted above), but even when some of these conditions are not satisfied, it should still function as a measure of (dis)similarity between two multivariate distributions.  And since the data I’m working in resides in only 10 dimensions, going through the whole deletion collection is a snap.  Hopefully it can produce sensible results.

Advertisements