Extracting Knowledge from Biomedical Text

HubMed's Tag Storage allows users to store statements in the form

[noun phrase] … verb … [noun phrase]

[DSS1] binds to [BRCA2].

This allows you in the future to see all the statements that have been made about a particular subject, for example DSS1.

What I wanted to do was semi-automate the process of extracting information from the text of an abstract and turning it into statements. There has been lots of work on identifying gene and protein names, and thus deducing their interactions, from biomedical literature. Services such as Whatizit and Termino make it easy to identify gene and protein names (by comparing the text to a database of known nouns), while MMTx discovers UMLS Metathesaurus concepts in text, but ideally a parser could pick out all the phrases in which things could be said to interact.

So far, I'm using MedPost (which is like ePost, but trained on a biomedical corpus) to tag the parts-of-speech (ie nouns, adjectives, verbs, etc) and then a fairly nasty regular expression to capture the phrases. The result is far from perfect, but you can get results like:

* [DSS1] is an evolutionarily conserved [acidic protein].
* [Essentially all BRCA2 in human cell lines] is associated with [DSS1].
* [DSS1 depletion] also led to [hypersensitivity to DNA damage].
* [The stability of BRCA2 protein in mammalian cells] depends on [the presence of DSS1].

from an abstract:
DSS1 is an evolutionarily conserved acidic protein that binds to BRCA2. However, study of the function of DSS1 in mammalian cells has been hampered because endogenous DSS1 has not been detectable by Western blotting. Here, we developed a modified Western blotting protocol that detects endogenous DSS1 protein, and used it to study the function of DSS1 and its interaction with BRCA2 in mammalian cells. We found that essentially all BRCA2 in human cell lines is associated with DSS1. Importantly, we found that RNAi knockdown of DSS1 in human cell lines led to dramatic loss of BRCA2 protein, mainly due to its increased degradation. Furthermore, the stability of BRCA2 mutant devoid of the DSS1-binding domain is unaffected by the depletion of DSS1. Most notably, like BRCA2 depletion, DSS1 depletion also led to hypersensitivity to DNA damage. These results demonstrated that the stability of BRCA2 protein in mammalian cells depends on the presence of DSS1.

The problem is that it only recognises statements written in a very simple order - to be more sophisticated requires a more complicated parser, which takes a lot more time and processing power, or a human interaction.

I've added this to HubMed for now, anyway: click the 'Tag' link under an abstract, then 'suggest annotations'.

See also: GeneWays, which seems to be the most sophisticated parser around, but it's apparently not open for everyone to use.

Update: Added clickable noun phrases and verbs in the abstract, highlighted with coloured boxes.