Revisiting a couple of old posts... notes on getting a Solr index of Wikipedia on EC2, now that booting from EBS volumes is available.
- Launch a Ubuntu Karmic AMI [more AMIs] in the EC2 console.
- Convert it to an EBS AMI.
- Launch the EBS AMI.
- Create a 20GB volume (for data) in the same zone; attach it to the instance at /dev/sdf.
- Format the volume: mkfs.xfs /dev/sdf
- Mount the volume at /solr:
sudo mkdir /solr sudo mount /dev/sdf /solr
- wget -O - http://apache.mirror.facebook.net/lucene/solr/1.4.0/apache-solr-1.4.0.tgz | tar -xvz and move dist/apache-solr-1.4.0.war to /solr/solr.war
- mkdir -p /solr/wikipedia/conf and put schema.xml and solrconfig.xml in that folder.
- sudo apt-get install tomcat6
- Put wikipedia.xml in /etc/tomcat6/Catalina/localhost with solr.home set to /solr/wikipedia and docBase set to /solr/solr.war
- Make sure Tomcat can read and write /solr/wikipedia. Restart Tomcat. It should create /solr/wikipedia/data.
- Create a 100GB volume from Freebase's Wikipedia snapshot (currently snap-1781757e: "Wikipedia Extraction-WEX (Linux)", though it's a year old snapshot) in the same zone; attach it to the instance at /dev/sdg.
- Mount the volume at /wex: sudo mount /def/sdg /wex
- Import the WEX TSV file (about 4 million documents) into Solr:
curl 'http://localhost:8080/wikipedia/update/csv?commit=false&overwrite=false&separator=%09&encapsulator=%1f&header=false&fieldnames=id,title,updated,xml,text&skip=updated,xml&stream.file=/wex/rawd/freebase-wex-2009-01-12-articles.tsv&stream.contentType=text/plain;charset=utf-8'
- Optimize the Lucene index: curl 'http://localhost:8080/wikipedia/update?optimize=true'
- Query Solr using the MoreLikeThis handler:
http://YOUR-PUBLIC-DNS.amazonaws.com:8080/wikipedia/mlt?rows=5&wt=json&stream.body=the+text+to+compare+to+the+index
Optional: create a new dyndns.com zone, use http://${USER}:${PASS}@members.dyndns.org/nic/update?hostname=${HOST} to set the IP address on startup.
Optional: create a new snapshot from the EBS data volume and share it so that anyone can use the Lucene index and Solr configuration with their own instance of Solr.