Everyone who works in bioinformatics has encountered this problem at some point: Given a set of genes, cross reference each of the entries against another set of databases and aggregate the resulting queries as evidence with which to support conclusions drawn from some experiments. Common instances of this problem include: analyzing high throughput screening data for significant compounds / genes, performing gene enrichment analyses, and searching for GO term enrichment. The big problem here is nomenclature. Every database has its own ID for a gene/protein/ORF. So your gene set/list may be in one form, while the information you need is probably indexed in a database under another.
For example, the gene RAD52 is (according to SGD) known both by its standard name (RAD52) and its systematic name (YML032C). Yet it is also known as UPI00001683C2, UPI0000052F95(Uniparc – UniProt), AAA50352.1 AAT93163.1 CAA86623.1 DAA09866.1 (EMBL), NP_013680.2 (RefSeq), P06778-1 P06778.2 (SwissProt – UniProt). There are web applications that help searches like these. Two that do a good job are EBI’s Protein Identifier Cross-Reference service, and Uniprot’s ID mapper.
Yet still, each time I have had to solve this type of problem there are mistakes that must be manually corrected, as well as post processing on the returned table or matrix to correct formatting before your analysis can even begin. It seems like a problem where the minimum cost is still quite high for each instance, especially since the problem sizes are commonly on the order of hundreds of genes. I wonder, how much time has been collectively spent on this problem by all the bioinformatics researchers in the world? Also, does anyone out there have a better way to handle these types of jobs?