With great timing, a paper by Deendayal Dinakarpandian and colleagues, MachineProse: An Ontological Framework for Scientific Assertions, showed up in JAMIA today (a preprint is available). This paper is along the same lines as Leigh Dodd's analysis of the need for the raw data of scientific publications to be made available for reuse (also discussed briefly a while ago at Nodalpoint), and fits perfectly with my attempts to simplify the manual extraction of statements from scientific abstracts.

The authors propose that when papers are submitted for publication, they are accompanied by metadata describing the assertions made in the text. This means using a defined syntax to codify the assertions in a machine-readable format. The MachineProse syntax is based around the Subject-Predicate-Object triplet, and includes qualifiers to restrict the scope of an assertion as well as defined relationship concepts (the MachineProse Ontology).

[O]nce found, it takes a fair amount of effort and time to extract the required information from a paper. Ideally, if one agreed upon a formal model of representing information, machines (computer programs) could aid in the process of keeping scientists and professionals up to date. Given a biomedical paper, let us focus on the question: “How has our knowledge of the world changed after publication of this article?� The answer to the question may be distilled into the scientific assertion(s) that the paper makes. There is usually a plethora of information in a paper but most of this serves merely to justify the assertions, and set them in context. These details are generally of peripheral importance. Similarly, at the receiving end, the assertions are what the reader registers and carries away, even after reading the full paper. At the moment, abstracts serve the role of summarizing papers. Some journals require abstracts to be organized into sections, but this is still not machine-readable as unrestricted prose is used.

On the other hand, if the conclusions of a paper were summarized in a machine-readable formal structure of assertions that is not inordinately onerous, this would greatly aid both the submission and dissemination of cutting-edge scientific information.

The introduction to the paper also provides an excellent overview of text-mining approaches to extracting knowledge from scientific texts, which the authors suggest could be used to work through the backlog of papers that have already been published.