Collecting article metrics with OpenRefine

Stephanie Haustein and colleagues recently described the lack of correlation between tweets about an article (using Altmetric data from July 2011 - December 2012) and formal citations of the article. They investigated 1.4 million articles indexed in both PubMed and Web of Science.

I decided to look at the data for smaller sets of articles, published in specific journals. After writing a few scripts to fetch and parse data to CSV from various web services, using the DOI as the key for each row, I realised that it would be easier to gather the data in OpenRefine by incrementally adding columns.

Start OpenRefine

Launch the OpenRefine application and open the web interface.

Import a CSV file

Create a new project

Enter the URL of some CSV data (e.g. the output of a Solr query), or choose a local CSV file (e.g. the result of a SPARQL query).

Check the settings (UTF-8 encoding, comma separator, ignore blank rows) > create project

Fetch JSON data from a URL

"doi" column > Edit column > Add column by fetching URLs ("altmetric_json") > GREL:

'http://api.altmetric.com/v1/doi/' + value.escape('url')

Parse JSON data to a new column

"altmetric_json" column > Edit column > Add column based on this column ("tweets") > GREL:

value.parseJson().cited_by_tweeters_count

Fetch XML data from a URL

"doi" column > Edit column > Add column by fetching URLs ("crossref_xml") > GREL:

'http://www.crossref.org/openurl/?noredirect=true&pid=' + 'YOUR_CROSSREF_PID'.escape('url') + '&id=doi:' + value.escape('url')

Parse XML data to a new column

"crossref_xml" column > Edit column > Add column based on this column ("citations") > GREL:

value.parseHtml().select('query')[0].htmlAttr('fl_count').toNumber()

Merge data from another project

Import a CSV file, with columns "doi" and "citations", to a new project named "citations_scopus".

"doi" column of original project > Edit column > Add column based on this column > GREL:

cell.cross("citations_scopus", "doi").cells["citations"].value[0]

Scatterplot facet

"tweets" column > Facet > Scatterplot facet

Export to CSV

Export > Custom Tabular Exporter > unselect JSON and XML columns > Download

Import to Google Spreadsheets

One limitation of the data used here is that the dates of each tweet and citation are not known; it might be interesting to correlate tweets and citations during specific windows of time after article publication.

Note: I've realised that there's missing data for up to half of the articles, so it would be incorrect to draw any conclusions yet.