Topic Modelling with MALLET

MALLET is topic modelling software produced by Andrew McCallum's group at the University of Massachussetts. It's open source, written in Java but can be run from the command line, and has decent usability and documentation.

  1. First, you need a set of text documents to create the topic model.
  2. export MALLET="/path/to/mallet/bin/mallet" (the path to wherever you've put the MALLET executable)
  3. export MODEL="your_model_name" (this can be whatever you like)
  4. Create a single file ($MODEL-mallet.txt) that has one document per line, in the format 'IDENTIFIER LANGUAGE TEXT', e.g. '12345 en This is the text. Some more text.'.
  5. Convert that input file to MALLET's input format:
    $MALLET import-file --input $MODEL-mallet.txt --output $MODEL.mallet --keep-sequence --remove-stopwords
  6. Run MALLET to build a topic model:
    $MALLET train-topics --input $MODEL.mallet --output-model $MODEL-topic-model.gz --num-threads 2 \
    --num-topics 100 --num-iterations 1000 --doc-topics-threshold 0.1 --optimize-interval 10 --num-top-words 100 \
    --word-topic-counts-file $MODEL-word-topic-counts.txt --output-doc-topics $MODEL-doc-topics.txt --output-topic-keys $MODEL-topic-keys.txt
  7. Parse the output files: $MODEL-doc-topics.txt will contain the correspondences between input documents and topics; $MODEL-topic-keys.txt will contain the list of top keywords for each topic.

The most important parameter in the modelling step is num-topics, which sets the number of topics to which MALLET should fit documents. Estimate this manually depending on the size of the collection and a rough appraisal of the granularity of topics that you want to create.

Using --keep-sequence-bigrams when importing and --use-ngrams true when modelling should include bigrams in the selection of topic keywords, but I haven't managed to successfully run that without running out of memory yet.