HTML metadata for journal articles

·

You’d think it would be easy to pin down an ontology for journal articles. There are basically just these properties:

  • title
  • authors[]
  • datePublished
  • abstract

But… some of those are shared with more generic classes higher up the tree, so abtract becomes description, title becomes name, author becomes creator. Each author can be a string or an object. Each author has one or more affiliations, which have addresses. The authors are in a specific order, and some of them have certain roles. There are several different dates: creation, review, update, publication.

  • author: { name: { displayName, familyName, initials, middleName, lastName }, role, affiliation }

Then the big one - each article (a "Work") is expressed in various different forms ("Instances", or publication events). It might be published in one or more collections/periodicals, and not just in a “journal”, but on a certain page of an issue which is part of a volume which is part of a periodical, which has an ISSN (and an eISSN, and an ISSNL) and a title (in multiple forms of abbreviation):

  • journalName
  • issn
  • issue
  • volume

The work itself can have identifiers (DOI, PMID, arXiV, etc) which may or may not be URLs and may or may not be applicable to each publication event. The work may also be rendered in different formats (HTML, PDF) and languages, each of which has its own URL and metadata.

  • htmlURL
  • pdfURL
  • DOI
  • PMID
  • arxiv
  • language

Thus, while schema.org has a ScholarlyArticle class (and even a MedicalScholarlyArticle class), it’s quite incomplete and doesn’t even cover a lot of the citation_* tags that Google Scholar indexes.

There’s a W3C working group - “Schema Bib Extend” - trying to extend the schema.org schema for bibliographic markup (along with similar efforts in other working groups for comics and other serials/periodicals).

OCLC have made their own extension to schema.org to add classes for things like Periodicals and Newspapers.

FreeBase has an extensive set of types and properties around Scholarly Works and Citations, including Journal Article.

BIBO is an existing bibliographic ontology, which is similar to Zotero’s field definitions, and there are similar attempts in FaBiO, BIRO and BibJSON.

There's the MODS XML schema, which I rather like. MODS has proven itself as an intermediary format in bibutils, and converts quite cleanly to JSON.

The newest entrant is BIBFRAME from the Library of Congress: yet to release an ontology, but with a clear overview defining Work, Instance, Authority and Annotation superclasses, where a Work is published as one or more Instances.

The nested/graph approach is pleasing, theoretically: Article (Work) -> hasInstance -> Article (PDF) -> isPartOf -> Issue -> isPartOf -> Volume -> isPartOf -> Journal -> hasISSN -> ISSN. One the other hand schema.org is looking for simple key/value pairs attached to an object, and practically it seems to work ok (in Zotero and Mendeley, at least) that the article has “issn”, “startPage”, “endPage”, “volume”, “issue” etc attached to it rather than to one or more associated “isPartOf” entities.

When you come to add markup to HTML to describe these properties (it will be great when articles are just published as HTML with metadata embedded, rather than having to generate XML in multiple formats for archiving and submission to various systems), there are several ways to add this metadata: links to alternate formats fit nicely as rel=alternate links; while either HTML5 microdata or RDFa Lite (which are essentially equivalent, except that RDFa Lite has simpler attribute names while microdata has a defined DOM API and a redundant “itemscope” attribute) are available for adding key-value properties to an object.

The main aim of adding this markup, currently, is so that when someone bookmarks/shares the page (either privately or in public), the information that’s displayed/saved to their collection is easily readable and correct. On the other side of things, having a standard set of metadata that can be passed between various services is also useful, when you want to use the same metadata about an article to look it up in multiple services.

A search engine like Google Scholar probably only really needs a few fields to identify and describe a Work: title, authors, publication date, abstract/description and URL. For locating, filtering, or referring to a specific instance of a work, though, the other fields become useful.

I’ve added the basic schema.org microdata and RDFa Lite to an HTML rendering of PubMed articles at http://pubmed.macropus.org/{pmid}, e.g. http://pubmed.macropus.org/23180662. If you have an application that allows people to bookmark/share PubMed articles, that might be a good URL to use.