How To Text Mine Open Access Documents

Fetching the documents

First of all, find a set of open access documents in a standard XML format. Articles deposited in PubMed Central (PMC) are ideal, as they are converted from publisher-specific DTDs to one of the standard NLM Journal Article DTDs during deposition. PMC also has an OAI interface, which makes it straightforward to find and retrieve articles.

To find the name of a set of articles, use the OAI "ListSets" command to fetch all the sets into a local CSV file. Have a look through that file and find the set you're interested in - in this case I'm using "elsevierwt": Elsevier's "Sponsored Documents", for which a fee has been paid on publication to make the articles open access; the license allows text mining for non-commercial purposes*.

Use that set name with the OAI "ListIdentifiers" command to fetch the identifiers for all documents in that set into a local CSV file. This script checks that each article is also in the "pmc-open" set, which denotes the Open Access subset of PubMed Central.

For each identifier, use the OAI "GetRecord" command to fetch the document XML into a local folder. The document identifier can be base64-encoded into the filename, so it's easy to identify later.

Converting the documents

Convert all the XML files to the most up-to-date NLM Journal Article DTD, using the XSL transformation provided by the NLM for this purpose. In this case, I'm converting from v2 to v3 of the NLM Journal Article Archiving and Interchange format; once JATS becomes the official standard hopefully the same tools will be provided for conversion.

Convert the article metadata from the XML into RDF triples in Turtle, and store them in a Kasabi data set:

find . -name '*.ttl' -exec curl -vvv -H "Content-Type: text/turtle" --data-binary @{}{$STORE}/store?apikey={$APIKEY} \;

Finally, convert the body of the article to simple HTML, using another XSL transformation. All the inline elements will become "span" elements, all the block-level elements will become "div" elements**.

Text mining

Now the articles are ready for text mining. Choose an entity extraction tool or web service and run each article through it. I'm using the EBI's Whatizit here, which has a SOAP web service that understands plain text and returns XML. If you're lucky, you'll have a simple HTTP POST web service that understands HTML and returns JSON.

Store the results locally, and extract the data you need into RDF triples as Turtle. So far, I've extracted disease and protein names from these articles using Whatizit; the easiest way to find the names for the Whatizit processing pipelines is to View Source and look at the options in Whatizit's HTML entry form.

Post the Turtle files to the same Kasabi data set as the article metadata, where they can be browsed and queried using SPARQL:

find . -name '*.ttl' -exec curl -vvv -H "Content-Type: text/turtle" --data-binary @{}{$STORE}/store?apikey={$APIKEY} \;

* This particular license is quite vague, full of restrictions, and doesn't mention what you can do with derivative works - such as the results of text mining. You might want to choose a set of articles from PLoS or BioMed Central instead, which are clearly licenced with Creative Commons CC-BY licences.

** Each element retains its attributes, and a "class" attribute is added for styling if you ever want to display this HTML.