Working with the Harvard Library Bibliographic Dataset

Earlier this week, Harvard Library released a set of metadata for 12 million items held by the library, under a Creative Commons CC0 (Public Domain) license.

Here's what I've done with it so far:

Start a Ubuntu server instance on Amazon EC2 (the "medium" variety works best, or the "high-CPU medium" (dual-core) if you have a script that can run in parallel).

Create a 10GB volume, attach it to the instance and mount it at /marc:

sudo mkfs.xfs /dev/xvdf
sudo mkdir /marc
sudo mount /dev/xvdf /marc
sudo chmod 777 /marc

I also mounted an empty 50GB volume at /mods, for storing individual files transformed to a different format (see below).

The dataset is available as a gzipped tarball containing 10 MARC21-formatted files. Download to a temporary directory using aria2:

aria2c --max-connection-per-server=5
or wget:
wget --continue

Once the file has downloaded, extract the MARC21 files: tar -xvzf harvard.tar.gz -C /marc. I've shared an EBS snapshot of this stage (snap-a099a1dd in us-east-1), so you can start from this point by creating a volume from it and attaching it to a running EC2 instance.

Run a PHP script that opens each MARC21 file, converts each record to MARCXML using File_MARC, then transforms each record to MODS, using a stylesheet provided by the Library of Congress. I've also shared an EBS snapshot of this stage (snap-90333fed in us-east-1), as it took a while to run. It happened to leave the MODS namespace off the output XML, which may or may not make it easier to work with…

I've begun one more step, which is an XSL transformation from MODS to CloudSearch input XML - the idea being to import all the data into a CloudSearch instance and make it browsable/searchable there. There are still some fields to add, though. Other transformations that might be useful include Turtle, for import into a triplestore like Kasabi; HTML, for browsable/crawlable individual records; JSON, for loading into ElasticSearch, MongoDB or a JS interface.