Guardian + Lucene = Similar Articles + Categorisation

·

The Guardian's collection of articles is in one respect a large, manually-categorised corpus of documents. I fetched the 13,000 articles categorised as 'Science', fed them to Solr, and used that to generate similar articles and their categories.

For example, this Nature News article has these similar Guardian articles:

  1. Timeline of the universe
  2. Beyond the Standard Model
  3. Creation in the blink of an eye
  4. Dark side of creation
  5. Super-massive Q&As
  6. A ghostly halo that could unlock the dark secret of the universe
  7. The day after the big bang
  8. String fellows
  9. String theory: Is it science's ultimate dead end?
  10. Why one is still the loneliest number

and the 20 most similar articles have these categories:

  1. Space exploration [9]
  2. Astronomy [8]
  3. UK news [8]
  4. Physics [7]
  5. Education [7]
  6. Higher education [7]
  7. Research [6]
  8. Particle physics [4]
  9. World news [3]
  10. News [3]
  11. Technology [3]
  12. Cern [2]
  13. Observer [2]
  14. Stephen Hawking [1]
  15. Paul Davies [1]
  16. Controversies in science [1]
  17. People in science [1]
  18. Science and nature [1]
  19. Books [1]
  20. Mathematics [1]