Guardian + Lucene = Similar Articles + Categorisation

The Guardian's collection of articles is in one respect a large, manually-categorised corpus of documents. I fetched the 13,000 articles categorised as 'Science', fed them to Solr, and used that to generate similar articles and their categories.

For example, this Nature News article has these similar Guardian articles:

  1. Timeline of the universe
  2. Beyond the Standard Model
  3. Creation in the blink of an eye
  4. Dark side of creation
  5. Super-massive Q&As
  6. A ghostly halo that could unlock the dark secret of the universe
  7. The day after the big bang
  8. String fellows
  9. String theory: Is it science's ultimate dead end?
  10. Why one is still the loneliest number

and the 20 most similar articles have these categories:

  1. Space exploration [9]
  2. Astronomy [8]
  3. UK news [8]
  4. Physics [7]
  5. Education [7]
  6. Higher education [7]
  7. Research [6]
  8. Particle physics [4]
  9. World news [3]
  10. News [3]
  11. Technology [3]
  12. Cern [2]
  13. Observer [2]
  14. Stephen Hawking [1]
  15. Paul Davies [1]
  16. Controversies in science [1]
  17. People in science [1]
  18. Science and nature [1]
  19. Books [1]
  20. Mathematics [1]

Comments

How did you return the category counts? Are they in the same solr/lucene return as the articles or are they calculated from a separate query?

The categories are stored in Lucene alongside each article (a multivalued field) - they're then summed across the set of results.

All fields are optional, email address will not be shown; no HTML, URLs are automatically hyperlinked.