Making a Lucene index of Wikipedia for MoreLikeThis queries

See Chris Sizemore's post for why this is useful, Rattle Research's conText project that takes this one step further, and the conText API that provides an interface to their Lucene index of Wikipedia but also performs extra term disambiguation.

I chose to use Freebase's Wikipedia Extraction dump rather than the raw data from Wikipedia, as the articles have been processed from Wiki markup into plain text.

  1. Download the WEX articles file (TSV, 4.3GB compressed). The current version is freebase-wex-2008-11-06-articles.tsv.bz2
  2. Optionally run through the file once with fgetcsv/fputcsv to keep only the 'id', 'title' and 'text' fields, discarding 'updated' and 'article' (see PHP code below). This gets the filesize down to 1.5GB compressed.
  3. If using a 32-bit system that can't open the large uncompressed TSV file, split it into sections:
    split -l 100000 freebase-wex-2008-11-06-articles-min.tsv 'freebase-segment-'
  4. Set up Solr with a new document schema—'wikipedia'—containing 'id', 'title' and 'body' fields. Run through the TSV file and post each item to Solr. I chose to add documents in batches of 1000 and to commit changes and optimise the index after every 100000 documents, but this depends on the amount of memory and open files available in your system.
  5. Make sure Solr's configuration for this index has a MoreLikeThisHandler defined.
  6. To use the index to find articles that are similar to a given chunk of text, try the code below.
  7. Once you have a list of similar articles, you can use the code below to find the categories that have been assigned to those articles in Wikipedia. The most common of these are likely to be applicable to your original text.

Stripping out unwanted information from the TSV file

<?php
// wget -c 'http://download.freebase.com/wex/freebase-wex-2008-11-06-articles.tsv.bz2'
// bunzip2 'freebase-wex-2008-11-06-articles.tsv.bz2'
$in = fopen('freebase-wex-2008-11-06-articles.tsv', 'r');
$out = fopen('/somewhere/freebase-wex-2008-11-06-articles-min.tsv', 'w'); // different drive, for speed
while (($data = fgetcsv($in, 0, "\t")) !== FALSE) {
  unset($data[2]); // 'updated'
  unset($data[3]); // 'article'
  fputcsv($out, $data, "\t"); // 'id', 'title', 'text'
}

MoreLikeThis queries

<?php
$params = array(
  'stream.body' => $text,
  'fl' => 'id,title',
  'rows' => 50,
  'start' => 0,
  'wt' => 'json',
  'mlt.fl' => 'title,body',
  'mlt.mintf' => 1,
  'mlt.mindf' => 1,
  'mlt.maxqt' => 20,
  'mlt.minwl' => 3,
  'mlt.boost' => 'true',
  'mlt.interestingTerms' => 'details',
  );
$data = json_decode(file_get_contents('http://localhost:8080/wikipedia/mlt?' . http_build_query($params)));
$similar = $data->response->docs;
$terms = $data->interestingTerms;

Wikipedia categories

<?php
function wikipedia_categories($titles){
  $params = array(
    'action' => 'query',
    'format' => 'json',
    'prop' => 'categories',
    'redirects' => 'true',
    'cllimit' => 500,
    'clshow' => '!hidden',
    'titles' => implode('|', $titles),
    );
  
  $data = json_decode(file_get_contents('http://en.wikipedia.org/w/api.php?' . http_build_query($params)));
  
  $categories = array();
  foreach ($data->query->pages as $page){
    if ($page->categories)      
      foreach ($page->categories as $category)
          $categories[$category->title]++;      
  }
  arsort($categories);
  return $categories;
}