A Solr index of Wikipedia on EC2/EBS

Revisiting a couple of old posts... notes on getting a Solr index of Wikipedia on EC2, now that booting from EBS volumes is available.

  1. Launch a Ubuntu Karmic AMI [more AMIs] in the EC2 console.
  2. Convert it to an EBS AMI.
  3. Launch the EBS AMI.
  4. Create a 20GB volume (for data) in the same zone; attach it to the instance at /dev/sdf.
  5. Format the volume: mkfs.xfs /dev/sdf
  6. Mount the volume at /solr:
    sudo mkdir /solr
    sudo mount /dev/sdf /solr
  7. wget -O - http://apache.mirror.facebook.net/lucene/solr/1.4.0/apache-solr-1.4.0.tgz | tar -xvz and move dist/apache-solr-1.4.0.war to /solr/solr.war
  8. mkdir -p /solr/wikipedia/conf and put schema.xml and solrconfig.xml in that folder.
  9. sudo apt-get install tomcat6
  10. Put wikipedia.xml in /etc/tomcat6/Catalina/localhost with solr.home set to /solr/wikipedia and docBase set to /solr/solr.war
  11. Make sure Tomcat can read and write /solr/wikipedia. Restart Tomcat. It should create /solr/wikipedia/data.
  12. Create a 100GB volume from Freebase's Wikipedia snapshot (currently snap-1781757e: "Wikipedia Extraction-WEX (Linux)", though it's a year old snapshot) in the same zone; attach it to the instance at /dev/sdg.
  13. Mount the volume at /wex: sudo mount /def/sdg /wex
  14. Import the WEX TSV file (about 4 million documents) into Solr:
    curl 'http://localhost:8080/wikipedia/update/csv?commit=false&overwrite=false&separator=%09&encapsulator=%1f&header=false&fieldnames=id,title,updated,xml,text&skip=updated,xml&stream.file=/wex/rawd/freebase-wex-2009-01-12-articles.tsv&stream.contentType=text/plain;charset=utf-8'
  15. Optimize the Lucene index: curl 'http://localhost:8080/wikipedia/update?optimize=true'
  16. Query Solr using the MoreLikeThis handler:
    http://YOUR-PUBLIC-DNS.amazonaws.com:8080/wikipedia/mlt?rows=5&wt=json&stream.body=the+text+to+compare+to+the+index

Optional: create a new dyndns.com zone, use http://${USER}:${PASS}@members.dyndns.org/nic/update?hostname=${HOST} to set the IP address on startup.

Optional: create a new snapshot from the EBS data volume and share it so that anyone can use the Lucene index and Solr configuration with their own instance of Solr.