A Resourceful Alternative to OAI-PMH

·

As institutions created repositories of metadata for their digital objects, there was a need for a standard interface which anyone could use to harvest metadata for those records, which could then be combined into one, central index. OAI-PMH (Protocol for Metadata Harvesting) was specified for this purpose in 2001-2.

OAI-PMH is best described as "somewhat archaic". Since it was specified, the Atom Publishing Protocol has defined a better way to work with collections and resources, JSON has replaced XML for transporting metadata, and many frameworks provide a more standard RESTful interface to data. Still, hundreds of repositories support OAI-PMH, as it is a stable and basically effective standard.

OAI-PMH is based on calling methods over HTTP (ListSets, ListRecords, GetRecord, etc) and parsing namespaced sections out of the response XML. The responses have a schema, but there's no schema for the requests or a WSDL file, so an OAI-PMH service can't be accessed using SOAP (though this was discussed).

Here's a description of the way OAI-PMH approaches the basic things you might want to do with an interface to a repository's metadata, and possible alternatives: collections of resources ("/sets", "/records"), content negotiation for response formats (XML, JSON), self-describing responses and URLs.

Listing all the records in the repository

OAI-PMH: Fetch a list of all records in the repository, using the query "verb=ListRecords&metadataPrefix={metadataPrefix}".

If there's more than a particular (unspecified) number of items in the list, the list is split into several pages. There's generally no count of how many pages there are (there's a completeListSize value, but it's optional), so you keep going until no resumption token is returned.

The metadataPrefix is a bit unnecessary, as all repositories have to provide metadata in "oai_dc" format (a fixed subset of the standard Dublin Core elements), which is enough to provide basic metadata for all records. This (or, ideally, the standard Dublin Core set of elements) should be the only metadata format provided by OAI-PMH. Records in any other formats (NLM DTD, etc) should be available independently at separate URLs, specified using link[rel=alternate].

It would also be good to use Link headers in the response, so that the document doesn't need to be parsed in order to find the URL of the next page of results (this way, gzipped content can be fetched and stored directly without intermediate decompression).

Alternative: "GET /records" to return all records in a repository. Optional: query parameters for pagination and filtering by set:

GET /records
Accept: application/json
Accept-Encoding: gzip

HTTP/1.1 200 OK
Link: <http://example.com/records>; rel=self
Link: <http://example.com/records?before=1234>; rel=next
Content-Type: application/json
Content-Encoding: gzip
{
	"$self": "http://example.com/records",
	"$next": "http://example.com/records?before=1234"
	"total": 10000,
	"items": [
		{
			"$self": "http://example.com/records/example-record-a",
			"id": "oai:example:example-record-b",
			"ordinal": "1235",
			"dc": {
				"title": "Example Record B",
				"date": "2012-05-01",
			}
		},
		{
			"$self": "http://example.com/records/example-record-b",
			"id": "oai:example:example-record-b",
			"ordinal": "1234",
			"dc": {
				"title": "Example Record A",
				"date": "2012-05-01",
			}
		},
	]
}

List all the sets in the repository

OAI-PMH: Fetch a list of all sets in the repository, using the query "verb=ListSets".

The list of sets can be split over several pages.

You get a list of identifiers and names for sets this way, but no counts of how many items are in each set - for that you need to fetch all the records in a set, and count them yourself.

While records have a specified format for identifiers, sets are just simple strings.

Alternative: "GET /sets" to return all sets in a repository:

GET /sets
Accept: application/json

HTTP/1.1 200 OK
Content-Type: application/json
{
	"$self": "http://example.com/sets",
	"$next": "http://example.com/sets?before=1234"
	"total": 1000,
	"items": [
		{
			"$self": "http://example.com/set/journal-of-cell-biology",
			"id": "oai:example:set-journal-of-cell-biology",
			"ordinal": "1235",
			"title": "Journal of Cell Biology",
			"$records": "http://example.com/records?set=oai:example:journal-of-cell-biology",
		},
		{
			"$self": "http://example.com/set/journal-of-biological-chemistry",
			"id": "oai:example:set-journal-of-biological-chemistry",
			"ordinal": "1234",
			"title": "Journal of Biological Chemistry",
			"$records": "http://example.com/records?set=oai:example:journal-of-biological-chemistry",
		},
	]
}

List all the records in a set

OAI-PMH: Fetch a list of all records in a set, using the query "verb=ListRecords&set={setSpec}&metadataPrefix={metadataPrefix}".

Alternative: "GET /records?set={setID}":

GET /records?set=oai:example:journal-of-cell-biology
Accept: application/json

HTTP/1.1 200 OK
Content-Type: application/json
{
	"$self": "http://example.com/records?set=oai:example:journal-of-cell-biology",
	"$next": "http://example.com/records?set=oai:example:journal-of-cell-biology&before=1234"
	"total": 10000,
	"items": [
		{
			"$self": "http://example.com/records/example-record-a",
			"id": "oai:example:example-record-b",
			"ordinal": "1235",
			"dc": {
				"title": "Example Record A",
				"date": "2012-05-01",
			}
		},
		{
			"$self": "http://example.com/records/example-record-b",
			"id": "oai:example:example-record-b",
			"ordinal": "1234",
			"dc": {
				"title": "Example Record B",
				"date": "2012-05-01",
			}
		},
	]
}

Fetch information for a single set

Not possible with OAI-PMH.

GET /set/journal-of-biological-chemistry
Accept: application/json

HTTP/1.1 200 OK
Content-Type: application/json
{
	"$self": "http://example.com/set/journal-of-cell-biology",
	"id": "oai:example:set-journal-of-cell-biology",
	"title": "Journal of Cell Biology",
	"$records": "http://example.com/records?set=oai:example:journal-of-cell-biology",
}

Fetch information for a single record

OAI-PMH: Fetch a record using the query "verb=GetRecord", and a "identifier" parameter that identifies the record using a non-HTTP OAI identifier.

The non-HTTP identifier is reasonable, as it may be necessary to identify a record uniquely even if the OAI or host repository URL changes, to avoid duplicates.

Alternative: "GET /records/{id}":

GET /records/example-record-a
Accept: application/json

HTTP/1.1 200 OK
Content-Type: application/json
{
	"$self": "http://example.com/records/example-record-a",
	"id": "oai:example:example-record-b",
	"dc": {
		"title": "Example Record A",
		"date": "2012-05-01",
	}
}

Find records that have been deleted from a repository

OAI-PMH: When fetching a list of records, there's no mention of any records that have been deleted. To know which items have been deleted, you have to use the ListIdentifiers method to get the current list of identifiers for records in a set, and remove any records from your local copy that are not in that list.

Alternative: Add a "status=deleted" property to records in the repository that have been deleted, rather than removing them entirely. If the metadata for the item is removed, these can be simple "tombstone" records:

GET /records/example-record-a
Accept: application/json

HTTP/1.1 401 GONE
Content-Type: application/json
{
	"$self": "http://example.com/records/example-record-a",
	"id": "oai:example:example-record-b",
	"ordinal": "1236",
	"status": "deleted",
}

Browse the list of records in a web browser

OAI-PMH: Some repositories provide an XSL file which web browsers can use to render the response as HTML.

Alternative: Return the response as HTML, with semantic elements marked up using RDFa Lite.

GET /records?set=oai:example:journal-of-cell-biology
Accept: text/html

HTTP/1.1 200 OK
Content-Type: text/html
<!DOCTYPE html>
<html>
<head>
  <link rel="self" href="http://example.com/records?set=oai:example:journal-of-cell-biology">
  <link rel="next" href="http://example.com/records?set=oai:example:journal-of-cell-biology&before=1234">
  <meta name="total" content="10000">
</head>
<body>
  <ol class="items">
    <li>
      <table vocab="http://schema.org/" typeof="CreativeWork"
                resource="oai:example:example-record-a">
        <tr>
          <th>title</th>
          <td property="dc:title"><a href="http://example.com/records/example-record-a">Example Record A</a></td>
        </tr>
        <tr>
          <th>date</th>
          <td property="dc:date">2012-05-01</td>
        </tr>
      </table>
    </li>
    <li>
      <table vocab="http://schema.org/" typeof="CreativeWork" 
                resource="oai:example:example-record-b">
        <tr>
          <th>title</th>
          <td property="dc:title"><a href="http://example.com/records/example-record-b">Example Record B</a></td>
        </tr>
        <tr>
          <th>date</th>
          <td property="dc:date">2012-05-01</td>
        </tr>
      </table>
    </li>
  </ol>
</body>
</html>