Clustering documents with CLUTO

After getting a local copy of the metadata for around 360,000 articles, I wanted to use some clustering/topic modelling to divide them into categories for browsing. CLUTO was suggested, and it worked pretty well (though it's only working off the document titles so far, so there isn't much opportunity for semantic analysis - just word matching).
  1. Export all the documents to a file with one document per line. From a MySQL table of document titles, this is as simple as SELECT title FROM articles then export as CSV.
  2. Use doc2mat to convert the list of documents into a term matrix: doc2mat -nostem titles.csv titles.mat
  3. Run vcluster on the matrix to produce 1000 clusters, and ask it to suggest distinctive features and summaries of each cluster:
    vcluster -showfeatures -nfeatures 20 -showsummaries cliques titles.mat 1000 > clusters.txt
  4. Parse the clusters file and generate SQL statements for inserting the clusters back into the original database.
  5. Copy the cluster features section of the vcluster output into a new file, and parse it to extract the clusters and their features (there is a libcluto available, and a Perl library, but regular expression parsing was easy enough).
    This generates a set of SQL statements for a "clusters" table.
  6. Run the SQL statements to import the clusters data: mysql -u USER -p DATABASE < clusters.sql; mysql -u USER -p DATABASE < cluster-features.sql
Problems with this method:
  • Each document only gets assigned to one cluster, whereas ideally they could be placed in multiple categories with scores for each category.
  • There's no immediate way to add new documents to existing clusters, without using a separate tool.
  • CLUTO isn't open source.