A confession: when I started this blog, it was because I was afraid I was currently missing a necessary part of my resume: a blog. A reasearch blog, or some other internet presence I could direct would-be employers/collegues to so that they might learn more about me. Supposedly everyone had them, and if you don’t have one, you’re at a serious disadvantage. What I don’t yet understand is how everyone has time to update their blogs while actually getting work done. Over the last two weeks, I’ve been working pretty hard on one of my course projects, in preparation for a trade-show style poster presentation. The resulting poster is nice, and I’ve come up with a couple of nice results, but have had little to no time for blogging.
So, currently I’m working on two different things:
1) A project that tries to cluster spam messages by body content and correlate this with botnet data (which is a hot topic these days). It might be really useful if I can get Affinity Propagation working in the analysis pipeline. The pipeline works kind of like this: (Extract message bodies, turn them into transactions) –> (Code the transactions as integers) –> (Mine the transactions for frequent patterns. Kind of like a dimensional reduction in the bag-of-words vector space representation model.) –> (Tabulate the feature matrix by parsing the transactions and counting the TF x IDF of the messages) –> (Cluster the messages using a standard K-means algorithm. This is performed several times with random re-assignments in a sound manner to increase the likelihood of reproducible clusters that capture some semantic meaning) –> (Interpret the output). It’s a little frustrating as I’m working with a huge data set, and our computing cluster is behaving a little flaky (disk problems!). Currently, it looks as if K-means is finding meaningful clusters, but it’s too slow to be useful: clustering ~12000 messages based on ~10000 features is about 45 minutes of computing. That’s not fast enough when you consider that one day’s worth of spam messages is no less than 864000. Affinity Propagation (http://www.psi.toronto.edu/affinitypropagation/) is an exemplar-based clustering technique that is much faster than K-means, and would really make the analysis a lot faster if I can get a perl-to-c interface up and running. I understand perl is extended through XS (eXStension?) or SWIG (http://www.swig.org/papers/Perl98/swigperl.htm), though I have no experience in either, and both look non-trivial.
2) An assignment that deals mainly with a new sequencing technology for transcriptome quantification called RNASeq (Nature Methods Vol 5 No 7 July 2008). We are asked to extend their methods to a probabilistic model when dealing with multi-reads. I haven’t had much time to devote to this, so my post gets a little sparse here.
3) My own research into noise models for MRM mass-spec protocols. Lots of reading to do here, and a couple of ideas that need more work, but I’ll have to finish my course work first.
So, that’s it for now. If all goes well I’ll post some results and insight into the spam generating process in the next week or so.