The State of Biomedical PDFs

Introduction

The publication of scientific results in text form, accompanied by a minimal amount of interpreted data, is a historical aberration caused by having to use print as a distribution medium. Now that distribution is occurring online, results can be published in full with all the accompanying images, structured data and methods for reproduction. At the moment, however, we're still stuck halfway: clicking through poorly designed websites and ending up with folders full of metadata-less PDFs designed for print.

I decided to do a survey of the PDFs available from various publishers, using a selection of papers that I'd recently accessed and that were published in the last 6 months (i.e. they're the cutting edge of the publisher's technology). I looked at:

The authentication method used to access the full text article (mostly by IP address, sometimes using a proxy) when not at an institutional computer; often authenticating via Athens; occasionally unable to access at all - the failure rate was actually much higher from small publishers, but I didn't include them all).
The method of linking to the PDF file.
How the PDF was displayed (PDFs in frames are annoying, as they get shrunk down to fit in the space; so is delayed downloading).
The filename of the downloaded PDF (looking for information that would be useful in identifying the paper later on).
The metadata added to the PDF's Info Dictionary (viewed using 'Document Properties' in Adobe Reader).
Whether there were any restrictions on use of the PDF content.
Whether the PDF contained a DOI so that the paper could easily be located online.
Labelled sections in the PDF, which show up in the Bookmarks pane in Adobe Reader, allowing quick navigation.
Markup in the HTML page which would allow a machine to automatically identify and download the PDF file.

Results

Very Good Things are marked in green; Very Bad Things are marked in red.

Publisher	Journal	PMID	Authentication	PDF link	Download method	Filename	Info Dictionary	Security	DOI in PDF	Bookmark sections in PDF	PDF link markup
Wiley	European Journal of Immunology	16453386	IP address or Athens login	PDF (86k)	framed PDF	fulltext	correct article title	no	yes	no	no
Nature Publishing Group	Laboratory Investigation	16446705	IP address (via proxy)	Download PDF (with PDF icon)	direct download	3700389a.pdf	none	no	yes	no	no
Highwire	Journal of Immunology	16301683	IP address (no alternative)	Full Text (PDF)	framed, delayed PDF	7728.pdf	none	no	no	no	yes [4]
Elsevier (ScienceDirect)	Biol Blood Marrow Transplant	16399597	IP address or Athens login	PDF (422 K)	direct download	science	DOI in title field	no	yes	no	no
Highwire	Journal of Experimental Medicine	16380508	IP address (no alternative)	PDF (Full Text)	framed, delayed PDF	119.pdf	“untitled" in title field	no	no	no	yes [4]
Karger	Chem Immunol Allergy	16354957	username (couldn't access)	Article (PDF 139 KB)
Elsevier (ScienceDirect)	J Allergy Clin Immunol	16354957	IP address or Athens login	PDF (151 K)	direct download	science [1]	DOI in title field	no	yes	yes	no
Oxford Journals	J Ntl Cancer Inst	16333031	IP address (no alternative)	Full Text (PDF)	direct, delayed download	1760.pdf	title="dji401.indd", author="elampa1r"	no	yes	no	no
Nature Publishing Group	Nature Immunology	16311599	IP address (via proxy)	Download PDF (with PDF icon)	direct download	ni1289.pdf	title="npgrj_NI_1289.83..92"	yes: no copying or extraction (password protection) [2]	yes	no	no
Blackwell Synergy	Oral Microbiol Immunol	16238600	IP address or Athens login	Image link at the bottom of the page: PDF [401KB]	direct download in popup window	j.1399-302x.2005.00241.x	title="omi_241 382..386"	no	no	no	no
Wolters Kluwer Health	Journal of Immunotherapy	16224273	IP address or Athens login	Image as input button: Full Text(PDF) 134K	framed PDF	00002371-200511000-00006	keywords="560"	no	no	no	no
AAAS	Science	16123302	IP address (via proxy)	Full Text (PDF)	direct download	1380.pdf	title="1377 1380..1384"	no	no [3]	no	no
BioMed Central	BMC Immunology	16179091	none [5]	PDF (3,650KB)	direct download	1471-2172-6-23.pdf	title="1471-2172-6-23.fm", author="csproduction"	no	yes	yes [6]	no

Notes:

[1] Filename is PIIS009167490501941X.pdf when accessed through the journal's site rather than through ScienceDirect.
[2] The protection on PDFs from Nature journals prevents anyone from copying content (whether for fair use or not) and converting the PDF to text. It's nominally password protection, but there's no actual password, so anyone with Adobe Acrobat can remove the protection. I don't know of any journals other than Nature titles which do this.
[3] The PDF is taken straight from the print version, so contains the start and end of adjacent papers.
[4] In the HEAD of the HTML page, there are meta tags which include Dublin Core metadata and <meta name="citation_pdf_url" which contains the URL for the PDF.
[5] All of BioMed Central's papers are open access, so can be freely accessed from anywhere.
[6] Extensive use of both major and minor section headings.

Conclusions

While most of the larger publishers provided an acceptable method of authentication, the PDF files they produce are obviously not optimised for ease of use by the reader. It's almost impossible to build a tool to automatically fetch PDFs for papers (to attach to a bibliographic library in Endnote or BibDesk, say), because there are no machine-readable links to the PDF files. <link rel="alternate" type="application/pdf" href="http://path/to/the/pdf"/> would be ideal for this use. Once the PDFs are downloaded, having a folder full of files named "science", "science(2)", "science(3)", etc, is no use at all (especially as they have no file extensions). Most publishers use page numbers as filenames, but even that's not very helpful: something like "first author-year-journal name-volume-page number.pdf" would be much better. Having got the PDFs into some kind of order, there's then no metadata attached to any of them (apart from those published by Wiley, commendably, but even those still only have the title). There's space in the Info Dictionary for Title, Author, Subject and Keywords, and that's without even beginning to use XMP. Finally there's the placement of the DOI, which should really be in the metadata but needs at least to be in the PDF text; Nature's bizarre copy protection; and the bookmark sections (Introduction, Methods, Results, Discussion, etc) which are rarely present but would also be useful, especially for searching in particular sections.

The implementation of all of these features could be automated with little change to the publishers' systems, but would be a major benefit to researchers struggling to deal with large amounts of literature.