Relevance-ranked search results in HubMed

·

I'm pleased to say that I've finally got a Lucene index of the MEDLINE database up-and-running, so HubMed is now able to return search results ordered by relevance. This option can be set using the menu on the front page or by appending &sort=relevance or &sort=date (default) to the end of a search URL.

Relevance-ranked searches use the Lucene query syntax, so normal queries using the PubMed query syntax won't translate directly, but this does make more powerful searches possible, including phrase searching. The default grouping operator is AND, the same as Google, and the list of stopwords is the same as that used by PubMed.

This isn't the first time this has been done: BioMedNet's (discontinued) MEDLINE search produced relevance-ranked results years ago; the EBI's new CitExplore service has a Lucene index of both MEDLINE and CiteSeer (though their word stemming is problematic) and both Elsevier's Scopus and Google Scholar can order search results by relevance. However, the ability to search in this way is something that HubMed has needed for a long time.

Limitation: At the moment, only the title and abstract fields are indexed. There is no query expansion (matching to MESH terms, etc), so the number of results may be lower than searching using the standard method. Also no stemming is currently performed on queries, so plurals won't be automatically matched.

Thanks to PyLucene developer Andi Vajda as well as Ed Summers and others on #code4lib for advice with the indexing.