Joining the dots - advances in online biomedical literature management.

1. Searching abstracts and fulltext

Since its launch in 1997, the NCBI's PubMed database has grown to become the preeminent provider of abstract data to biomedical researchers worldwide [1]. Based on the MEDLINE and pre-MEDLINE catalogues of literature published in over 4,500 journals, PubMed provides extensive links to databases containing DNA sequences, taxonomical hierarchies, 3D protein structures and much more. By allowing journal publishers to submit metadata corresponding to their published articles, PubMed has also become a central index for locating the full-text of articles online. As a search engine it works well, but the ever-increasing volume of medical literature published each year has shown the need for additional methods of knowledge management.

2. Regular updates

Researchers generally need to keep up-to-date with the latest developments in their field, to prevent repetition of others' work and to feed off the latest innovations and discoveries. When investigating a new subject area, the many variant search engines and interfaces to MEDLINE data [2,3,4,5,6,7] are valuable. However, for many researchers the burden of keeping up with newly published literature is becoming unmanageable. Information providers are therefore establishing a number of ways to automate this process.

E-mail (BioMail)

By registering their e-mail address with a BioMail server [8], a researcher can set up numerous searches which automatically run at a set frequency (daily, weekly, etc) and any new results will be delivered by email. This process should become even more advanced in the future, as the BioMail project has received funding to investigate further features such as personalised literature treasuries [9]. This sort of project could also begin to include innovations such as automated suggestion of papers that are deemed to be relevant (analogous to the features of Tivo, where AI algorithms are used to suggest TV programmes that may be of interest based on viewing habits). Using BioMail saves having to check for new results by hand, but places further demands on a researcher's attention to an already bulging mailbox.

RSS

Originally intended for desktop delivery of news headlines, RSS [10] has grown to be many people's primary means of receiving up-to-date information from the internet. By providing a summary of articles recently posted to a website, this format allows receipt of categorised information, the collection of which is automated and can be read, searched and followed up at any time.

- HubMed search

The relevance of RSS to biomedical research is obvious. Automated daily (or even hourly) delivery of any articles added to the PubMed database that match your search terms is an effortless way to keep up with the literature. Setting up these alerts is also simple - just use the HubMed interface [5] to search PubMed, and drag the orange XML icon into your favourite newsreader.

- Newsreaders, daily updates

There are many news aggregation applications available for many platforms, but I shall mention here the two most fully featured and easy to use, though unfortunately not free. NetNewsWire Pro [11] (for the Macintosh) and NewzCrawler [12] (for the PC) both have similar features, including setting the frequency for retrieving updates. The RSS summaries that are available from HubMed all include abstracts and citation details, as well as links to further resources such as the online full-text of the articles. A major feature of these two newsreaders is that they include support for the Blogger and MetaWebLog APIs, which allows the reader to post to an online, publicly available information log.

3. Collaborative knowledge logs (journal club)

Blogging from newsreaders

Weblogs have been around for many years, and are continuing to gain popularity as a means of aggregating and distributing topical information to communities of like-minded people. The opportunities for collaboration are opened up immensely by these focussed news hubs, and the influence carried by the highest ranking is considerable [14]. Collaborative weblogs are also strong community voices, and the most powerful [15,16] display a kind of smart mob behaviour [17], bringing the immediate attention of thousands of people to any points raised within.

On a different level, the collaborative filtering efforts of commercial sites such as Faculty of 1000 [18] bring the recommendations of experts into one place. However, as can be seen with hiplogging [19], the voices of unselected individuals can be just as interesting, if not as discerning.

In the case of biomedical research, the need for accurate information is too important to allow noise to enter the system, so the participants must be limited. Analogous to a real-world journal club, the ability of newsreader software to allow weblog posting means that anyone with a username and password can post their choice of information to a topical, publically-readable weblog.

In this case, as illustrated by ImmunoLog [13], anyone who finds a newly published paper particularly outstanding can post the abstract directly from the RSS feed, via the newsreader, along with their own description and discussion of the work. The ability of Movable Type [20] to allow anyone to post comments then opens up the discussion to other readers, who can also add recommendations for links to related papers and other online information.

4. Literature archiving

PubMed Central - free access, but limited by publishers' restraint

So the alerting service is all sorted out, we have open collaborative filtering of topical research, all we need to do now is make sure that the refereed papers are available for everyone to read. Unfortunately, but unsurprisingly based on historical precedent [21], this has met with some resistance from those who make their profits from restricting free access to literature.

Many discussions [22,23,24,25] have focussed on the cost to universities of paying for access rights to companies who hold exclusive copyrights on the papers that they publish. The resulting efforts to self-archive at the organisational level, by university or department, is valuable, and the agreements by many publishers to make their archives freely available after a set period of time (usually 6 months to 1 year after publication) has also greatly opened up the availability of the literature to libraries with over-stretched budgets. However, even the efforts of PubMed Central to create a central, full-text searchable database of the freely available literature has been partially thwarted by many publishers' insistence that articles be available only from their own site [26].

P2P, Gnutella, distributed or centralised storage

A combination of local archiving, storage by publishers, and central searchable archives seems inevitable. The means of distribution of articles, however, is undecided, and may be the Web (as at present, perhaps using OAI [27]), or a distributed peer-to-peer mechanism such as Gnutella (which allows individuals to add further metadata to the articles they distribute, whether it be true or false [28,29]). While the WWW remains the prevalent form of literature distribution, and in the absence of a centralised archive, the benefits of open linking technology become evident.

5. Literature retrieval

Personalised bookmarklet allows lookup in local library; SFX server provides links to external resources and local catalogues

By maintaining a central database of the online locations of individual articles, accessible using the OpenURL format [30], linking services enable researchers to rapidly find the paper they need. Individual libraries running their own SFX servers [31] are able to adapt the links provided to include searching their own catalogues, linking to external sources to which the library has subscribed, and many other possibilities. The hurdle to be overcome in this situation is getting from the literature search engine or abstract index to one's appropriate local link server or library catalogue. This problem has been recently addressed by the dissemination of a Library Lookup bookmarklet and many variants [32], which can be personalised to enable one-click linking from any literature source containing the correct metadata (eg in the URL [33] or the markup [34]) to your choice of target. For example, a book on Amazon.com can be located in your local library, as can a journal article found using HubMed. If your library runs an SFX server, then the same article can be linked to the online fulltext with ease. A similar bookmarklet can be used to automatically fill out interlibrary loan forms, or order document delivery from other sources [35].

6. Local PDF stores

Once the researcher has succeeded in downloading a PDF file containing an article they were looking for, then the fun really starts. The dream of the paperless office began decades ago, and is now beginning to take root in reality. However, the hurdles are still present - but forseeably avoidable.

PDF, XML, metadata

PDF is the method of choice for distributing biomedical literature. It preserves formatting and allows the content to be displayed and printed identically on all major platforms. However, it is ridden with pitfalls for those trying to catalogue and work with their stored files. For a start, the text isn't stored as text, but as positioned individual letters. Adobe has done well in producing software that can guess the joining of words, lines and columns to a reasonable degree of accuracy, but compared to the searchability of XML (used by PubMed Central to allow full-text searching of its archives, as well as searching in particular sections of papers) this is severly limited. One of the main features of XML is semantic metadata - each section is tagged with descriptive, machine readable text, that greatly increases the ease of cataloguing.

Here are a few simple solutions:

Publishers tag PDF files with Title, Author, Subject and Keywords metadata;
Publishers tag PDF files with their unique PubMed ID number;
Publishers just have to name the file with the unique PubMed ID number.

In the last two cases, it's a simple matter to run a Perl script that uses the NCBI's E-Utilities Web Service [36] to add the metadata to the files yourself. Software such as PDF Explorer [37] can then be used to catalogue and search this information to find the file you need (and as an added bonus the files will have meaningful titles).

Endnote etc - export links from literature searches, link to local PDF files

The alternative is reference management software such as Endnote. Citations exported from HubMed, or retrieved directly from PubMed, often contain a link to the fulltext of the article online. The newest version of Endnote also allows references to be linked to a local PDF file. By linking the citation data to local files, the user is able to search through appropriate metadata without having to alter the PDF file. If Endnote was also able to store, link and search PubMed Central's XML fulltext, that would also be a useful step forward.

7. Prospects

We're moving towards the integration of search engines, email clients, newsreaders, knowledge management and publishing tools into single, plugin-based applications (eg Chandler [38]). With the incorporation of the facilities currently available in instant messaging clients (see Trillian [39] for the PC and Fire [40] for the Mac), the ability to collaborate with fellow researchers will be greatly increased, based around shared Wiki-like workspaces such as Groove [41].

Suitably, the future of Office applications seems to revolve around standardised, XML-based storage formats. Combined with new archival methods, we should hope to see a substantial increase in the availability and ease of discovery of biomedical literature online.

8. References

[1] http://www.earlham.edu/~peters/fos/timeline.htm
[2] http://www.ncbi.nlm.nih.gov/entrez/query.fcgi
[3] http://research.bmn.com/medline
[4] http://www.scirus.com/
[5] http://www.pmbrowser.info/
[6] http://pubmed.antarcti.ca/
[7] http://www.pmbrowser.info/pubmed.htm
[8] http://www.biomail.org/
[9] http://bioinformatics.org/forums/forum.php?forum_id=1269
[10] http://backend.userland.com/rss
[11] http://ranchero.com/software/netnewswire/profeatures.php
[12] http://www.newzcrawler.com/
[13] http://www.pmbrowser.info/immunolog/
[14] eg http://www.boingboing.net/
[15] http://www.slashdot.org/
[16] http://www.metafilter.com/
[17] http://www.smartmobs.com/
[18] http://www.facultyof1000.com/
[19] http://www.hiptop.com/
[20] http://www.movabletype.org/
[21] http://www.eff.org/
[22] http://www.sciencemag.org/feature/data/hottopics/plsdebate.shtml
[23] http://www.nature.com/nature/debates/e-access/
[24] http://www.topica.com/lists/fos-forum/read
[25] http://www.ecs.soton.ac.uk/~harnad/Hypermail/Amsci/subject.html
[26] http://www.pubmedcentral.nih.gov/about/newoption.html
[27] http://www.openarchives.org/
[28] http://www.well.com/~doctorow/metacrap.htm
[29] http://bitzi.com/
[30] http://demo.exlibrisgroup.com:9003/demo - The OpenURL format allows the exchange of information between information providers using the simplest and most accessible means available: the URL.
[31] http://www.sfxit.com/
[32] http://weblog.infoworld.com/udell/stories/2002/12/11/librarylookup.html
[33] eg http://www.amazon.com/exec/obidos/tg/detail/-/0151008116/
[34] http://diveintomark.org/archives/2002/12/29.html
[35] http://weblog.infoworld.com/udell/2002/12/30.html
[36] http://www.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html
[37] http://planeta.clix.pt/rtt/
[38] http://www.osafoundation.org/feature_summary.htm
[39] http://www.ceruleanstudios.com/trillian/index.html
[40] http://fire.sourceforge.net/
[41] http://www.groove.net/products/workspace/

This work is licensed under a Creative Commons License.