Analysing 'science' bookmarks in Delicious

·

One thing that Delicious doesn't provide is a list of the top items tagged with a given tag, ordered by the number of times the URL was tagged with that tag. For example, you can't get a list of the top items tagged with 'science'.

It's also not possible to fetch more than 2000 items (20 pages) from Delicious of items tagged with 'science'. It is possible, however, to fetch up to 20 pages for each combination of 'science' + another tag, and do this recursively until you've got a decent collection of pages.

Picking out all the unique bookmarks from those pages produces a list of around 100,000 'science' URLs, and keeping those that were tagged as 'science' by a reasonable number of unique users (about 20) reduces that to around 10,000 (which may also contain some duplicates at slightly different URLs). For each of the URLs you can use Delicious' API to gradually fetch the top tags assigned to that URL, with a count of how many users used each tag, and store that information in a database (13MB gzipped MySQL dump or 17MB of the original JSON files).

You can also fetch the original contents of each URL (4GB zipped, ask if you want it) in order to store the textual content (extracted from the whole page, perhaps) in an index.