Latent Semantic Indexing


Maciej Ceglowski is full of ideas on how [Latent Semantic Indexing] can be used: "It's good on large collections of text, written in a formal style: libraries of academic research, for example," he says. Biology, he points out, can benefit massively from LSI. The SVD algorithm doesn't actually require text at all. There is no language understanding, just a count of word frequency. If you take mass spectrographs of complex molecules, and treat each molecule as a document, and each peak on the spectrograph as a word, you can build searchable indexes in just the same way you can with text.

This could be revolutionary for medical science. By posting an entire text document into the search box, an LSI system will give you back a list of similar documents: a sort of "More Like This" search. While for text this is useful, it becomes revolutionary when applied to proteins. From a database of thousands of molecules, you can use LSI to find similar matches. You can find clusters - places on the matrix where the molecules are similar - and you might even find similarities you didn't know about.


[via Ben Hammersley.com]


And they never even mentioned DNA codons (although Maciej does include that in a comprehensive overview of the subject). I've been meaning to give this a try (as a literature search engine) for ages - I think PubMed Central might be able to supply a big enough stack of fulltext articles to make the index usable.