Sentence ordering in OTMI

The Biomedical Literature Mining Publications (BLIMP) archive has a fine collection of articles discussing the extraction of information from scientific papers. One in particular, A baseline feature set for learning rhetorical zones using full articles in the biomedical domain, assesses various methods of identifying classes of statements in papers, dividing them into categories such as 'background information', 'methods, 'experimental results', 'insights' and 'implications'. The authors found that knowing which sections of the paper the sentences were in was helpful for the classification (obviously), as was analysing each sentence in the context of the surrounding sentences (a +1/-1 window).

This has implications for the proposed Open Text Mining Interface (OTMI) where, as many people have complained in the comments, the location of sentences in the text is not preserved (which also makes the text almost useless for search indexing, as it breaks phrase queries).

One possibility would be to make the whole article available in the structured NLM Journal Publishing format, including all metadata, but remove all stopwords, figures and tables. Although the paper mentioned above found that the removal of stopwords had an effect on the classification, at least this document could be indexed and analysed while, presumably, not being a useful replacement for the full article for people to read, so would not affect journal subscriptions.