Guardian + Lucene = Similar Articles + Categorisation

March 10, 2009

The Guardian's collection of articles is in one respect a large, manually-categorised corpus of documents. I fetched the 13,000 articles categorised as 'Science', fed them to Solr, and used that to generate similar articles and their categories.

For example, this Nature News article has these similar Guardian articles:

and the 20 most similar articles have these categories: