Preserving PDF metadata

Michael Pascoe has a video on Bioscreencast that demonstrates how he gets PDFs into Papers from a web browser. It involves going through the Print... panel and choosing "Save PDF to Papers", which creates a brand-new PDF and then imports it into Papers.

The problem is that creating a new PDF removes all the metadata that was in the original document. Given the current state of publishers' metadata there probably wasn't much there anyway, and Papers has the nice feature for searching elsewhere (eg PubMed) then importing the metadata and matching it to the PDF.

Perhaps Papers could offer to import the original PDF (through a Services menu option?), but either way it would still be useful if Papers could look through the PDF, check for a DOI and use that to look up the basic metadata from CrossRef or PubMed (you can search for a DOI in PubMed and get the PMID, then use that to look up more metadata).

The other step is getting the metadata back out of Papers and into a reference/bookmark manager like Connotea or Zotero. I guess synchronisation through APIs and a remote server is a good way to go for this, but it would be nice to have a local method (involving Applescript or the Services menu again?) that would open a file containing the bibliographic metadata (RIS/BibTex/MODS/RDF) for the selected item in your web browser where it could be captured by your default handler application.

Comments

Personally, in 10.5, a decent spotlight plugin catered for academic PDFs would be great. On the "Papers" application, and I think I've mentioned one of them before, I have two unfair comments.

I've tried it, but for some reason a PDF manager that cannot *independently* handle citation management within an external text document is incomplete -- for me. Of course, one could just handle it with Bibdesk integration, etc. but software abstractions are tedious.

Second, I launched the app, nice splash screen, some interesting ways of organizing PDFs but I felt a little stifled by the naming convention option, because First *and* Last Author should be allowed within the title. After a compromise and launch the application crashed. On relaunch the splash screen remainded, some indexing was going on, and then I started to play with the application. I thought the naming convention could be changed with some more advanced preference management, two authors could be added to the field, that wasn't possible.

Then, after a few hours fiddling I just quit the thing. I guess my way of managing papers, as I have [log.phile.eu/2006/06/academic-pdf] written (a little) about before, is ingrained and a simple Launchbar/Quicksilver/Spotlight search does the trick for me. I realize the Papers application is so much more but I can decouple PDF reading and annotation with Skim and it doesn't feel so bad.

On the metadata front, which is on topic, I still believe the first thing one encounters in terms of "metadata" is the title of the document. It's 2007, there's enough advancement in web application tech that makes the current situation quite annoying. All the bells and whistles being added to journal sites is fine but it ignores some basic principles. Good metadata starts with the title, embed all the other information one wants, DOI, identifiers, etc. but cap the byte limit.

Sorry for the rant.

Posted by: gummi on November 7, 2007 9:25 AM

There's definitely a lot to say about PDF and metadata. Tony Hammond wrote a great series about it as well: http://www.crossref.org/CrossTech/2007/08/metadata_in_pdf_1_strategies.html

The problem at the moment is simple, what metadata?
Elsevier is the only one currently putting at least the DOI in the PDF metadata, all other publishers (as far as I know) put zip, nada. Well, that's not true, they put their Acrobat nicely puts the guy that registered the program as the author. Nice! And hence we got tons of feedback from people annoyed that Papers seemed to add some random author names and titles upon import of PDFs. So now we have a checkbox to ignore (!) the PDF metadata (usually the filename is even of more interest than the title field in the metadata of the PDF), and ignoring the metadata is now the default in Papers. Sad but true.

I've explained several publishers that they should add decent metadata to their PDFs, some didn't even see the point until you tell that not only Papers users might benefit that in fact ALL mac users benefit, TODAY. Spotlight indexes those fields, and adding author names, title etc would instantly make your PDFs findable in the Mac OS X 10.4 and 10.5 (>80% of all mac users). The good news is that most publishers I spoke to now realize this and plan on adding metadata to their PDFs sooner or later.

Regarding Papers and importing PDFs, importing the originals is as easy as dropping them on the icon. One inconvenience is that once opened in Safari we can't get access to the path of the (temporarily) downloaded file (e.g. through Applescript), hence the bookmarklet doesn't work for instance, neither would a service.

We certainly have Applescriptability on the list for Papers and with the export plugins it should be very easy to add a sync to fill-in-your-favourite-online-or-local-service/application fuction. We'd be happy to help anyone develop such plugins and when we have a bit of time we'd love to look at it ourselves as well.

Finally, we are indeed looking into parsing RDF or bibliography metadata while you're browsing the web inside Papers, such that it's more easy to pickup metadata along the way while you're browsing the web.

Cheers,
Alex
mek@mekentosj.com

Ps. @gummi thanks for the feedback, we'll see if we can add the option to add both the first and last author name, right now it's indeed either first or last.

Thanks Alex.

I think with a Firefox extension you should be able to get to the cached URL of the paper, so that could be worth a look for people that don't use Safari. Actually I open PDFs directly in Skim, and you're right, it's easy (and easy to forget) to just drag the icon onto Papers.

Just one thing: could you use pdftotext on the PDF and then use a regular expression to look for the DOI?

>>could you use pdftotext on the PDF and then use a regular expression to look for the DOI?

Actually, we do that already. PDFKit allows you to get the text from a PDF as well, it's not flawless but good enough for searching the DOI. If you click on Match in Papers it will already check if it can find a DOI and if so use it to search the repository. We will extend this further by allowing some kind of auto-matching upon import in a future version of Papers.

Forgot to add: not sure if you have seen this already but in the new Papers 1.5 you can set Skim (or any other PDF viewer) as your preferred PDF reader in the preferences. Instead of opening the PDF in a new tab it will then open it in Skim. Papers will even refresh the PDF after you bring it to the foreground again so your annotations appear immediately. They make up for a great duo that way.

"Actually, we do that already." - fantastic. And the preferred application as well.

All fields are optional, email address will not be shown; no HTML, URLs are automatically hyperlinked.