A Modular System for Automatic Entity Extraction and Manual Annotation of Academic Papers

At the recent "Beyond The PDF" conference in San Diego (which was pleasantly easy to attend remotely, because anyone could follow the webcast and participate via Twitter) there were several sessions that discussed entity extraction and manual annotation of academic papers. This, therefore, seems like a good time to write about the annotation system and user interface that I worked on at Nature Publishing Group over the last couple of years.

I'd been trying out annotation systems, and others had been working on similar things, for a while (the RSC's Project Prospect launched in 2007 and was being presented at conferences, while Phil Bourne presented several authoring tools at Data Webs, the first conference I attended after joining NPG). This project started out as a discussion and early prototype with Robert Hoffman, who had recently launched WikiGenes and had a biological entity extractor in use on iHOP; the initial aim was to allow authors to annotate genes and proteins mentioned in their papers. The focus of the project switched quickly to chemistry, though, as the launch of Nature Chemistry was getting close; from then on, this work was performed in close collaboration with the Nature Chemistry team, particularly technical editor Laura Croft, who provided ideas, most of the feature requests and user interface requirements for this system.

The annotation workflow expanded last year to cover more journals and now includes annotation of both chemical entities (compound names/molecular formulae) and biological entities (gene/protein names).

The curation interface, showing an annotation highlighted for editing and search results across chemistry-related databases.

Curating a set of gene/protein annotations, with search results across biology-related databases.

The system comprises five main parts:

Input: When an article XML file is uploaded, a specified list of elements (title, abstract, body, tables and figures) is converted to HTML using an XSL template. A configuration file specifies which of the original elements are blocks (which become divs in the HTML) and which are inline elements (which become spans). Some elements, such as links, get special treatment, and all element names are carried over to the class names of the HTML elements for styling. All named entities are converted to UTF-8 characters, and no characters are added to or removed from these elements, so the character positions in the HTML and the original XML are identical.
Entity extraction: The content is passed through several automatic entity extractors, each of which is specialised for a particular type of entity (chemical names, gene names, place names, etc). As most entity extractors prefer to parse plain text rather than HTML, the content is converted to text and separator characters are added to prevent annotation across the boundaries of block-level elements.

When the results of the entity extraction are returned from each extraction web service, the annotations are converted to a standard format, which is then stored in MongoDB. By accounting for the previously-added separator characters, the positions of each annotation can be correctly translated back to positions in the HTML/XML.
Curation: The HTML is displayed in a web browser as several identical overlaid layers: one base layer containing no annotations, one layer for each set of automatically-extracted annotations and one layer for each set of manually-curated annotations. The display of each of these can be toggled on or off by the curator, allowing several sets of annotations to be displayed concurrently without breaking the DOM by overlapping elements.

Each set of annotations is loaded on-demand, as JSON, so that the initial rendering of the page is fast. The text is transparent on all layers except the base layer, so it doesn't cause anti-aliasing artifacts, and the CSS pointer-events property is used to pass all clicks through to the base layer; highlighting a passage of text thus creates an annotation only in the base layer. Annotations in each layer are represented as inline spans: these have visible text, colours to show their state, and can receive clicks regardless of which layer they're in (the z-index of each layer determines which layer's annotations receive clicks in preference to other layers; the manual sets of annotations are in the foreground).

Each annotation can have a single entity attached to it: a data object with a set of metadata properties appropriate for the type of entity/annotation being curated. Once an entity is chosen from the search results (see below) and attached, the annotation is copied out of the set of automatic annotations and into one of the manual sets: these are the annotations which are going to be published.
Search: Creating a new annotation, or clicking on an existing annotation (which selects all annotations of the same text in the current document), launches a search across several databases, chosen according to the type of annotation being curated (chemistry, biology, etc). The curator can choose which of the known properties of the currently attached entity to search on: the default is to run a search on the "title" property using the text of the annotation. The results from each search source are converted to a standard format (currently HTML with pseudo-RDFa markup rendered server-side, but could easily be JSON rendered client-side into a template), and the search results are displayed. When one of the search results is selected by the curator (from any of the search sources), the entity represented by that search result is attached to all of the annotations currently being edited, replacing any entity already attached. A list in the sidebar keeps track of all the attached entities in each annotation set.
Export: The positions of the curated annotations are spliced back into the article XML, which then re-enters the publishing workflow. The annotations themselves - including the entities attached to each annotation - are stored in an XML database for retrieval when the article is rendered as HTML, where they are matched back up to the annotation positions inline in the article XML; this storage also allows the annotations to be published independently via an OpenSearch/SRU gateway.

This project is ongoing: there is much work to be done on streamlining the user interface and adding more features for chemistry and biology curation.

From a technical point of view, there are several things which could be improved, including using a system like Backbone.js to separate the data model from the DOM (making it easier to synchronise changes between the annotation data, the front-end display and the server-side storage). It might, perhaps, turn out to be better to store annotation positions relative to each node, and give each node a unique ID using the XPath for that node (as PLoS use for their public annotations system), rather than the more fragile system used here which counts the distance of each annotation node in characters from the start of the document.

The key benefit of this system is that it's straightforward to plug in more automated entity extractors as they become available: by standardising the input and output formats, we can make use of as many entity extractors and search sources as possible. The automated annotations are mostly used as hints to the human curators, though, so being able to store the corrections that the curators make and feed those back to the entity extractors will be a big improvement - not many automatic annotation services are set up to learn from manual feedback, yet.

As more search sources are added to the system, the similarities between this and Paolo Ciccarese's Semantic/SWAN Annotation Framework become more and more obvious (Paolo's work inspired the use of annotation sets here, for example). In the SAF, each entity is selected from a set of ontologies rather than from a set of databases, but basically the process is quite similar.

We're storing the properties of each entity as XML using simple key/value pairs (using CURIEs as the keys) in MarkLogic, but when publishing these annotations I hope that they can be published using both the Annotation Ontology and OpenAnnotation ontologies, which have similar aims in standardising the representation and publication of annotations.

While I'd like to be able to open-source the code for anyone to use, it's probably going to remain locked up. As alternatives, there's some excellent work on an annotation system at the Open Knowledge Foundation (built for annotating in the Open Shakespeare project), the automatic markup of entities in PubMed Central UK (using the modular text-mining system Whatizit, developed and maintained by Dietrich Rebholz's group at the EBI), the Semantic Annotation Framework mentioned above (being applied to a similar purpose as our tool, for curating the results of text-mining services, in collaboration with Elsevier), OntoText's Linked Life Data platform (not specifically about annotation, but lots of text mining and linked data) and many others.