The State of Biomedical PDFs

Introduction

The publication of scientific results in text form, accompanied by a minimal amount of interpreted data, is a historical aberration caused by having to use print as a distribution medium. Now that distribution is occurring online, results can be published in full with all the accompanying images, structured data and methods for reproduction. At the moment, however, we're still stuck halfway: clicking through poorly designed websites and ending up with folders full of metadata-less PDFs designed for print.

I decided to do a survey of the PDFs available from various publishers, using a selection of papers that I'd recently accessed and that were published in the last 6 months (i.e. they're the cutting edge of the publisher's technology). I looked at:

Results

Very Good Things are marked in green; Very Bad Things are marked in red.
Publisher Journal PMID Authentication PDF link Download method Filename Info Dictionary Security DOI in PDF Bookmark sections in PDF PDF link markup
Wiley European Journal of Immunology 16453386 IP address or Athens login PDF (86k) framed PDF fulltext correct article title no yes no no
Nature Publishing Group Laboratory Investigation 16446705 IP address (via proxy) Download PDF (with PDF icon) direct download 3700389a.pdf none no yes no no
Highwire Journal of Immunology 16301683 IP address (no alternative) Full Text (PDF) framed, delayed PDF 7728.pdf none no no no yes [4]
Elsevier (ScienceDirect) Biol Blood Marrow Transplant 16399597 IP address or Athens login PDF (422 K) direct download science DOI in title field no yes no no
Highwire Journal of Experimental Medicine 16380508 IP address (no alternative) PDF (Full Text) framed, delayed PDF 119.pdf “untitled" in title field no no no yes [4]
Karger Chem Immunol Allergy 16354957 username (couldn't access) Article (PDF 139 KB)
Elsevier (ScienceDirect) J Allergy Clin Immunol 16354957 IP address or Athens login PDF (151 K) direct download science [1] DOI in title field no yes yes no
Oxford Journals J Ntl Cancer Inst 16333031 IP address (no alternative) Full Text (PDF) direct, delayed download 1760.pdf title="dji401.indd", author="elampa1r" no yes no no
Nature Publishing Group Nature Immunology 16311599 IP address (via proxy) Download PDF (with PDF icon) direct download ni1289.pdf title="npgrj_NI_1289.83..92" yes: no copying or extraction (password protection) [2] yes no no
Blackwell Synergy Oral Microbiol Immunol 16238600 IP address or Athens login Image link at the bottom of the page: PDF [401KB] direct download in popup window j.1399-302x.2005.00241.x title="omi_241 382..386" no no no no
Wolters Kluwer Health Journal of Immunotherapy 16224273 IP address or Athens login Image as input button: Full Text(PDF) 134K framed PDF 00002371-200511000-00006 keywords="560" no no no no
AAAS Science 16123302 IP address (via proxy) Full Text (PDF) direct download 1380.pdf title="1377 1380..1384" no no [3] no no
BioMed Central BMC Immunology 16179091 none [5] PDF (3,650KB) direct download 1471-2172-6-23.pdf title="1471-2172-6-23.fm", author="csproduction" no yes yes [6] no

Notes:

[1] Filename is PIIS009167490501941X.pdf when accessed through the journal's site rather than through ScienceDirect.
[2] The protection on PDFs from Nature journals prevents anyone from copying content (whether for fair use or not) and converting the PDF to text. It's nominally password protection, but there's no actual password, so anyone with Adobe Acrobat can remove the protection. I don't know of any journals other than Nature titles which do this.
[3] The PDF is taken straight from the print version, so contains the start and end of adjacent papers.
[4] In the HEAD of the HTML page, there are meta tags which include Dublin Core metadata and <meta name="citation_pdf_url" which contains the URL for the PDF.
[5] All of BioMed Central's papers are open access, so can be freely accessed from anywhere.
[6] Extensive use of both major and minor section headings.

Conclusions

While most of the larger publishers provided an acceptable method of authentication, the PDF files they produce are obviously not optimised for ease of use by the reader. It's almost impossible to build a tool to automatically fetch PDFs for papers (to attach to a bibliographic library in Endnote or BibDesk, say), because there are no machine-readable links to the PDF files. <link rel="alternate" type="application/pdf" href="http://path/to/the/pdf"/> would be ideal for this use. Once the PDFs are downloaded, having a folder full of files named "science", "science(2)", "science(3)", etc, is no use at all (especially as they have no file extensions). Most publishers use page numbers as filenames, but even that's not very helpful: something like "first author-year-journal name-volume-page number.pdf" would be much better. Having got the PDFs into some kind of order, there's then no metadata attached to any of them (apart from those published by Wiley, commendably, but even those still only have the title). There's space in the Info Dictionary for Title, Author, Subject and Keywords, and that's without even beginning to use XMP. Finally there's the placement of the DOI, which should really be in the metadata but needs at least to be in the PDF text; Nature's bizarre copy protection; and the bookmark sections (Introduction, Methods, Results, Discussion, etc) which are rarely present but would also be useful, especially for searching in particular sections.

The implementation of all of these features could be automated with little change to the publishers' systems, but would be a major benefit to researchers struggling to deal with large amounts of literature.