Introduction
The publication of scientific results in text form, accompanied by a minimal amount of interpreted data, is a historical aberration caused by having to use print as a distribution medium. Now that distribution is occurring online, results can be published in full with all the accompanying images, structured data and methods for reproduction. At the moment, however, we're still stuck halfway: clicking through poorly designed websites and ending up with folders full of metadata-less PDFs designed for print.I decided to do a survey of the PDFs available from various publishers, using a selection of papers that I'd recently accessed and that were published in the last 6 months (i.e. they're the cutting edge of the publisher's technology). I looked at:
- The authentication method used to access the full text article (mostly by IP address, sometimes using a proxy) when not at an institutional computer; often authenticating via Athens; occasionally unable to access at all - the failure rate was actually much higher from small publishers, but I didn't include them all).
- The method of linking to the PDF file.
- How the PDF was displayed (PDFs in frames are annoying, as they get shrunk down to fit in the space; so is delayed downloading).
- The filename of the downloaded PDF (looking for information that would be useful in identifying the paper later on).
- The metadata added to the PDF's Info Dictionary (viewed using 'Document Properties' in Adobe Reader).
- Whether there were any restrictions on use of the PDF content.
- Whether the PDF contained a DOI so that the paper could easily be located online.
- Labelled sections in the PDF, which show up in the Bookmarks pane in Adobe Reader, allowing quick navigation.
- Markup in the HTML page which would allow a machine to automatically identify and download the PDF file.
Results
Very Good Things are marked in green; Very Bad Things are marked in red.Publisher | Journal | PMID | Authentication | PDF link | Download method | Filename | Info Dictionary | Security | DOI in PDF | Bookmark sections in PDF | PDF link markup |
---|---|---|---|---|---|---|---|---|---|---|---|
Wiley | European Journal of Immunology | 16453386 | IP address or Athens login | PDF (86k) | framed PDF | fulltext | correct article title | no | yes | no | no |
Nature Publishing Group | Laboratory Investigation | 16446705 | IP address (via proxy) | Download PDF (with PDF icon) | direct download | 3700389a.pdf | none | no | yes | no | no |
Highwire | Journal of Immunology | 16301683 | IP address (no alternative) | Full Text (PDF) | framed, delayed PDF | 7728.pdf | none | no | no | no | yes [4] |
Elsevier (ScienceDirect) | Biol Blood Marrow Transplant | 16399597 | IP address or Athens login | PDF (422 K) | direct download | science | DOI in title field | no | yes | no | no |
Highwire | Journal of Experimental Medicine | 16380508 | IP address (no alternative) | PDF (Full Text) | framed, delayed PDF | 119.pdf | “untitled" in title field | no | no | no | yes [4] |
Karger | Chem Immunol Allergy | 16354957 | username (couldn't access) | Article (PDF 139 KB) | |||||||
Elsevier (ScienceDirect) | J Allergy Clin Immunol | 16354957 | IP address or Athens login | PDF (151 K) | direct download | science [1] | DOI in title field | no | yes | yes | no |
Oxford Journals | J Ntl Cancer Inst | 16333031 | IP address (no alternative) | Full Text (PDF) | direct, delayed download | 1760.pdf | title="dji401.indd", author="elampa1r" | no | yes | no | no |
Nature Publishing Group | Nature Immunology | 16311599 | IP address (via proxy) | Download PDF (with PDF icon) | direct download | ni1289.pdf | title="npgrj_NI_1289.83..92" | yes: no copying or extraction (password protection) [2] | yes | no | no |
Blackwell Synergy | Oral Microbiol Immunol | 16238600 | IP address or Athens login | Image link at the bottom of the page: PDF [401KB] | direct download in popup window | j.1399-302x.2005.00241.x | title="omi_241 382..386" | no | no | no | no |
Wolters Kluwer Health | Journal of Immunotherapy | 16224273 | IP address or Athens login | Image as input button: Full Text(PDF) 134K | framed PDF | 00002371-200511000-00006 | keywords="560" | no | no | no | no |
AAAS | Science | 16123302 | IP address (via proxy) | Full Text (PDF) | direct download | 1380.pdf | title="1377 1380..1384" | no | no [3] | no | no |
BioMed Central | BMC Immunology | 16179091 | none [5] | PDF (3,650KB) | direct download | 1471-2172-6-23.pdf | title="1471-2172-6-23.fm", author="csproduction" | no | yes | yes [6] | no |
Notes:
[1] Filename is PIIS009167490501941X.pdf when accessed through the journal's site rather than through ScienceDirect.[2] The protection on PDFs from Nature journals prevents anyone from copying content (whether for fair use or not) and converting the PDF to text. It's nominally password protection, but there's no actual password, so anyone with Adobe Acrobat can remove the protection. I don't know of any journals other than Nature titles which do this.
[3] The PDF is taken straight from the print version, so contains the start and end of adjacent papers.
[4] In the HEAD of the HTML page, there are meta tags which include Dublin Core metadata and <meta name="citation_pdf_url" which contains the URL for the PDF.
[5] All of BioMed Central's papers are open access, so can be freely accessed from anywhere.
[6] Extensive use of both major and minor section headings.
Conclusions
While most of the larger publishers provided an acceptable method of authentication, the PDF files they produce are obviously not optimised for ease of use by the reader. It's almost impossible to build a tool to automatically fetch PDFs for papers (to attach to a bibliographic library in Endnote or BibDesk, say), because there are no machine-readable links to the PDF files. <link rel="alternate" type="application/pdf" href="http://path/to/the/pdf"/> would be ideal for this use. Once the PDFs are downloaded, having a folder full of files named "science", "science(2)", "science(3)", etc, is no use at all (especially as they have no file extensions). Most publishers use page numbers as filenames, but even that's not very helpful: something like "first author-year-journal name-volume-page number.pdf" would be much better. Having got the PDFs into some kind of order, there's then no metadata attached to any of them (apart from those published by Wiley, commendably, but even those still only have the title). There's space in the Info Dictionary for Title, Author, Subject and Keywords, and that's without even beginning to use XMP. Finally there's the placement of the DOI, which should really be in the metadata but needs at least to be in the PDF text; Nature's bizarre copy protection; and the bookmark sections (Introduction, Methods, Results, Discussion, etc) which are rarely present but would also be useful, especially for searching in particular sections.The implementation of all of these features could be automated with little change to the publishers' systems, but would be a major benefit to researchers struggling to deal with large amounts of literature.