Indexing Semantic Scholar's Open Research Corpus in Elasticsearch

·

Semantic Scholar publishes an Open Research Corpus dataset, which currently contains metadata for around 20 million research papers published since 1991.

  1. Create a DigitalOcean droplet using a "one-click apps" image for Docker on Ubuntu (3GB RAM, $15/month) and attach a 200GB data volume ($20/month).
  2. SSH into the instance and start an Elasticsearch cluster running in Docker.
  3. Create a new index with a single shard: curl -XPUT 'http://localhost:9200/scholar' -H 'Content-Type: application/json' -d '{ "settings": { "index": { "number_of_shards": 1, "number_of_replicas": 0, "codec": "best_compression" } } }'
  4. Install esbulk: VERSION=0.4.8; curl -L https://github.com/miku/esbulk/releases/download/v${VERSION}/esbulk_${VERSION}_amd64.deb -o esbulk.deb && dpkg -i esbulk.deb && rm esbulk.deb
  5. Fetch, unzip and import the Open Research Corpus dataset (inside the zip archive is a license.txt file and a gzipped, newline-delimited JSON file): VERSION=2017-10-30; curl -L https://s3-us-west-2.amazonaws.com/ai2-s2-research-public/${VERSION}/papers-${VERSION}.zip -o papers.zip && unzip papers.zip && rm papers.zip && esbulk -index scholar -type paper -id id -verbose -z < papers-${VERSION}.json.gz && rm papers-${VERSION}.json.gz
  6. While importing, index statistics can be viewed at http://localhost:9200/scholar/_stats?pretty
  7. After indexing, optimise the Elasticsearch index by merging into a single segment: curl -XPOST 'http://localhost:9200/scholar/_forcemerge?max_num_segments=1'
  8. (recommended) Use ufw to prevent external access to the Elasticsearch service and put a web service (e.g. an Express app) in front of it, mapping routes to Elasticsearch queries.