Semantic Scholar publishes an Open Research Corpus dataset, which currently contains metadata for around 20 million research papers published since 1991.
- Create a DigitalOcean droplet using a "one-click apps" image for Docker on Ubuntu (3GB RAM, $15/month) and attach a 200GB data volume ($20/month).
- SSH into the instance and start an Elasticsearch cluster running in Docker.
- Install esbulk:
VERSION=0.4.8; curl -L https://github.com/miku/esbulk/releases/download/v${VERSION}/esbulk_${VERSION}_amd64.deb -o esbulk.deb && dpkg -i esbulk.deb && rm esbulk.deb
- Fetch, unzip and import the Open Research Corpus dataset (inside the zip archive is a license.txt file and a gzipped, newline-delimited JSON file):
VERSION=2017-10-30; curl -L https://s3-us-west-2.amazonaws.com/ai2-s2-research-public/${VERSION}/papers-${VERSION}.zip -o papers.zip && unzip papers.zip && rm papers.zip && esbulk -index scholar -type paper -id id -verbose -purge -z < papers-${VERSION}.json.gz && rm papers-${VERSION}.json.gz
- While importing, index statistics can be viewed at
http://localhost:9200/scholar/_stats?pretty
- After indexing, optimise the Elasticsearch index by merging into a single segment:
curl -XPOST 'http://localhost:9200/scholar/_forcemerge?max_num_segments=1'
- (recommended) Use
ufw
to prevent external access to the Elasticsearch service and put a web service (e.g. an Express app) in front of it, mapping routes to Elasticsearch queries.