Managing Metadata for Academic PDFs

Note: I recommend that you make a backup of your PDFs before trying any of this, just in case. The XMP writing is particularly exerimental.

This tar.gz contains

  1. A PDF file of an academic paper that has its metadata written—in the form of a BibTeX string—into a) the Keywords field of the PDF Info Dictionary, b) the extended attributes (xattr) of the PDF file and c) the PDF file's XMP metadata stream.
  2. The Perl script used to identify the title of the paper, fetch the BibTeX metadata from HubMed and write it into the file.

The first of those metadata storage places—the Keywords field— is old-school and messes up the Keywords field, but works across platforms. If you run this command in the Terminal (while BibDesk isn't running):

defaults write edu.ucsd.cs.mmccrack.bibdesk BDSKDefaultGroupFieldSeparatorKey ", "
then drag the PDF into BibDesk, select the entry and run this Applescript from BibDesk's Scripts menu, it should read the BibTeX string, clean it up and update the entry accordingly.

The second storage place—the file's extended attributes— is not cross-platform and is lost when the file is transferred by email or zipped (it's stored in a hidden dot-file on Mac OS X and in a similar way on other UNIX-based systems). Thanks to Adam Maxwell, if you get the latest nightly build of BibDesk and run this command in the Terminal (while BibDesk isn't running):

defaults write edu.ucsd.cs.mmccrack.bibdesk BDSKShouldUsePDFMetadata 1
then you should be able to drag the PDF into BibDesk's main window and the metadata will automatically be imported from the extended attributes (and if you have AutoFiling turned on then the file will be renamed and moved to your papers folder automatically too).

The last storage place—the XMP metadata stream— is theoretically the ideal place for storing metadata in PDFs. Unfortunately there aren't many libraries for reading and writing XMP, so at the moment tools like BibDesk aren't able to use it.

You can read a file's extended attributes using an xattr script and read the Info Dictionary (sometimes) using mdls (on Mac OS 10.4). For the XMP, try extracting it with a regular expression.

See James Howison and Abby Goodrum's paper Why can't I manage academic papers like MP3s for a good description of where all this came from.