JANICE: a prototype re-implementation of JANE, using the Semantic Scholar Open Research Corpus



For many years, JANE has provided a free service to users who are looking to find experts on a topic (usually to invite them as peer reviewers) or to identify a suitable journal for manuscript submission.

The source code for JANE has recently been published, and the recommendation process is described in a 2008 paper: essentially the algorithm takes some input text (title and/or abstract), queries a Lucene index of PubMed metadata to find similar papers (with some filters for recency, article type and journal quality), then calculates a score for each author or journal by summing up the relevance scores over the most similar 50 articles.

JANE produces a list of the most relevant authors of similar work, and does some extra parsing to extract their published email addresses. As PubMed doesn't disambiguate authors (apart from the relatively recent inclusion of ORCID identifiers), the name is used as the key for each author, so it's possible (but unusual) that two authors with the same name could be combined in the search results.

Semantic Scholar

The latest release of Semantic Scholar's Open Research Corpus contains metadata for just over 20 million journal articles published since 1991, covering computer science and biomedicine. The metadata for each paper includes title, abstract, year of publication, authors, citations (papers that cited this paper) and references (papers that were cited by this paper). Importantly, authors and papers are each given a unique ID.


JANICE is a prototype re-implementation of the main features of JANE: taking input text and finding similar authors or journals. It runs a More Like This query with the input text against an Elasticsearch index of the Open Research Corpus data, retrieves the 100 most similar papers (optionally filtered by publication date), then calculates a score for each author or journal by summing up their relevance scores.

The results of this algorithm are promising: using open peer review data from one of PeerJ's published articles, JANICE returned a list of suggested reviewers containing 2 of the 3 actual reviewers within the top 10; the other reviewer was only missing from the list because although they had authored a relevant paper, it happened to not use the same keywords as the input text (using word vectors would help here).


This prototype was built as part of the development of xpub, a journal platform produced by the Collaborative Knowledge Foundation and partner organisations.