Indexing Semantic Scholar's Open Research Corpus in Elasticsearch

·

Semantic Scholar publishes an Open Research Corpus dataset, which currently contains metadata for around 20 million research papers published since 1991.

  1. Create a DigitalOcean droplet using a "one-click apps" image for Docker on Ubuntu (3GB RAM, $15/month) and attach a 200GB data volume ($20/month).
  2. SSH into the instance and start an Elasticsearch cluster running in Docker.
  3. Install esbulk: VERSION=0.4.8; curl -L https://github.com/miku/esbulk/releases/download/v${VERSION}/esbulk_${VERSION}_amd64.deb -o esbulk.deb && dpkg -i esbulk.deb && rm esbulk.deb
  4. Fetch, unzip and import the Open Research Corpus dataset (inside the zip archive is a license.txt file and a gzipped, newline-delimited JSON file): VERSION=2017-10-30; curl -L https://s3-us-west-2.amazonaws.com/ai2-s2-research-public/${VERSION}/papers-${VERSION}.zip -o papers.zip && unzip papers.zip && rm papers.zip && esbulk -index scholar -type paper -id id -verbose -purge -z < papers-${VERSION}.json.gz && rm papers-${VERSION}.json.gz
  5. While importing, index statistics can be viewed at http://localhost:9200/scholar/_stats?pretty
  6. After indexing, optimise the Elasticsearch index by merging into a single segment: curl -XPOST 'http://localhost:9200/scholar/_forcemerge?max_num_segments=1'
  7. (recommended) Use ufw to prevent external access to the Elasticsearch service and put a web service (e.g. an Express app) in front of it, mapping routes to Elasticsearch queries.