Extracting keyphrases from documents using MeSH terms and KEA

·
KEA extracts keyphrases from a set of documents. The README covers most of this.
  1. Create a folder called 'train' and, for each document in the training set, create a file with extension ".txt" containing the text of the document and a file with extension ".key" containing the known MeSH terms for this document (one per line).
  2. Create a folder called 'test' and, for each document in the test set, create a file with extension ".txt" containing the text of the document.
  3. Download and extract KEA. Fetch meshdata.rdf (a SKOS representation of the MESH hierarchy) and put it in the VOCABULARIES directory.
  4. From within the downloaded KEA folder, set up some environment variables:
    export KEAHOME=`pwd`
    export CLASSPATH=$CLASSPATH:$KEAHOME:$KEAHOME/lib/commons-logging.jar:$KEAHOME/lib/icu4j_3_4.jar:$KEAHOME/lib/iri.jar\
    :$KEAHOME/lib/jena.jar:$KEAHOME/lib/snowball.jar:$KEAHOME/lib/weka.jar:$KEAHOME/lib/xercesImpl.jar:$KEAHOME/lib/kea-5.0.jar
    
  5. Build the model:
    java -Xmx512M kea.main.KEAModelBuilder -l /path/to/training/folder -m articles -v meshdata -f skos -t NoStemmer
  6. Run KEA against the test set, using the model built above:
    java -Xmx512M kea.main.KEAKeyphraseExtractor -l /path/to/test/folder -m articles -v meshdata -f skos -t NoStemmer -n 10
  7. There should now be a set of ".key" files in the test folder, containing key phrases corresponding to each of the test documents.

In theory that should be enough, but I'm getting an error when KEA reads in the SKOS vocabulary. It seems to at least work with -v none for now, which doesn't use the vocabulary.