A Solr index of Wikipedia on EC2/EBS

Revisiting a couple of old posts... notes on getting a Solr index of Wikipedia on EC2, now that booting from EBS volumes is available.

Launch a Ubuntu Karmic AMI [more AMIs] in the EC2 console.
Convert it to an EBS AMI.
Launch the EBS AMI.
Create a 20GB volume (for data) in the same zone; attach it to the instance at /dev/sdf.
Format the volume: mkfs.xfs /dev/sdf

Mount the volume at /solr:

sudo mkdir /solr
sudo mount /dev/sdf /solr

wget -O - http://apache.mirror.facebook.net/lucene/solr/1.4.0/apache-solr-1.4.0.tgz | tar -xvz and move dist/apache-solr-1.4.0.war to /solr/solr.war
mkdir -p /solr/wikipedia/conf and put schema.xml and solrconfig.xml in that folder.
sudo apt-get install tomcat6
Put wikipedia.xml in /etc/tomcat6/Catalina/localhost with solr.home set to /solr/wikipedia and docBase set to /solr/solr.war
Make sure Tomcat can read and write /solr/wikipedia. Restart Tomcat. It should create /solr/wikipedia/data.
Create a 100GB volume from Freebase's Wikipedia snapshot (currently snap-1781757e: "Wikipedia Extraction-WEX (Linux)", though it's a year old snapshot) in the same zone; attach it to the instance at /dev/sdg.
Mount the volume at /wex: sudo mount /def/sdg /wex

Import the WEX TSV file (about 4 million documents) into Solr:

curl 'http://localhost:8080/wikipedia/update/csv?commit=false&overwrite=false&separator=%09&encapsulator=%1f&header=false&fieldnames=id,title,updated,xml,text&skip=updated,xml&stream.file=/wex/rawd/freebase-wex-2009-01-12-articles.tsv&stream.contentType=text/plain;charset=utf-8'

Optimize the Lucene index: curl 'http://localhost:8080/wikipedia/update?optimize=true'

Query Solr using the MoreLikeThis handler:

http://YOUR-PUBLIC-DNS.amazonaws.com:8080/wikipedia/mlt?rows=5&wt=json&stream.body=the+text+to+compare+to+the+index

Optional: create a new dyndns.com zone, use http://${USER}:${PASS}@members.dyndns.org/nic/update?hostname=${HOST} to set the IP address on startup.

Optional: create a new snapshot from the EBS data volume and share it so that anyone can use the Lucene index and Solr configuration with their own instance of Solr.