A belated MLSB 2013 retrospective

MLSB is a special interest group meeting that is co-located with ISMB, the Intelligent Systems for Molecular Biology conference. MLSB is devoted to Machine Learning research in the area of Systems Biology. It brings together machine learning folks, statisticians, geneticists, and biologists; it was a great place to strike up discussion about how to do better analysis on biological data. The full program is available online, so here I’m going to write up a few highlights for each day.


Markus Heinonen gave a solid talk about time dependent GP regression and significance testing for sparse time series. His group was examining the behaviour of cells under irradiation: how does this treatment affect both diseased and healthy cells with respect to gene regulation, cell processes, and the development of radiation disease. They pose this as trying to determine which genes / proteins are differentially expressed in irradiated cells over a time course. He applied GP regression to time series, using significance tests for sparse time series to determine significant expression (or changes). A nice piece of work, but they didn’t provide a citation since it was probably not yet accepted. But I found a link via google scholar.

Lani Wu gave a good invited talk about two recent stories from their lab involving neutrophil locomotion: one about perturbation analysis, and the other about cellular heterogeneity. There wasn’t anything really new in their methods, but what made an impression was the thoroughness of their analysis. Their lab has a great record of turning out quality work. As a fellow high throughput image biologist, who knows all the pain involved when trying to extract meaning from images (which are usually generated in a really noisy way), I was impressed 🙂

Quaid Morris gave a concise and informative talk, presenting work by Shankar that extended Sara Mostafavi‘s work on 3Prop. It addresses function prediction where the input is a set of heterogeneous networks, as well as a group of genes, and the desired outputs are labels for the set of query genes. Their method is called LMGraph, and it incorporates feature information into the function prediction algorithm. They extract discriminative features from all available networks, and combine them via a simple weighting scheme. This has nice properties: it’s highly scalable, using efficient linear classifiers, and it combines both network with feature data to learn interpretable models.


Johannes Stephan gave a cool talk about Mixed Random Forests. They developed a new model intended for GWAS data, to try and detect functional relationship between phenotype and genotype while accounting for confounding factors due to population structure.

The standard approach to addressing population structure is to use a linear mixed model:
y = x\beta + \mu + \epsilon
\mu \sim N(0,\sigma^{2}_{g}K) where K is the structured covariance matrix, \epsilon \sim N(0, \sigma^{2}_{v}I)
Then do a log ratio test on Beta coefficients.

A more difficult model is one where many SNPs that have interaction effects which are non-linear
f_s(X) = X_1\beta_1 + X_2\beta_2 + \ldots

Their idea is to combine linear mixed models with tree based models to infer f_s(X) in the presence of the population effect. The nice thing about growing trees is that the method is fast, but are unstable with regards to small changes in input (highly discontinuous). This instability can be mitigated by subsampling and building many trees (RFs). The take home message is to use RF for feature selection, they use linear mixed models for prediction.

Finally, John Marioni gave a really interesting technical talk about the challenges in single cell RNA sequencing. The big idea is: multiple cells from a population -> single cell RNA-seq -> correlate patterns across cells -> infer cell type.

The big challenges remaining:

  • quantify variability, identify highly variable genes
  • visualization and clustering
  • downstream interpretation

John described several technical aspects of experiments in single cell RNA-seq. Long story short: technical variability is still a problem, e.g count value ranges quickly increase when technical noise is large. One example nice bit of work was their work to estimate kinetic model parameters for stochastic models of gene expression from single cell RNA expression data. This talk again didn’t contain anything so new and shiny, but the quality and thoroughness of the work presented that captured my attention. This was my first exposure to real data from single cell RNA seq experiments, which alone was enough to impress me.