I started making a Drupal module that would pull in the RSS feed of my bookmarks from del.icio.us, store each bookmark as a Drupal node, store a cached version of each page and post it to Solr for searching.
It's working ok so far, disk space pending. This project, though, illustrates a number of problems:
- Web pages disappear - obviously, but more than you might think. Luckily I have the timestamp for each bookmark and should be able to use the Internet Archive to get a version of the page for indexing. Even so, there are a lot of bookmarked pages that it's a shame (but unavoidable, basically) to not be able to link to any more.
- When the server fetches a copy of the page for caching it doesn't necessarily see the same as I see in a web browser, particularly if it's a site where I'm logged in. [Zotero solves this problem nicely by running inside the browser and creating snapshots client-side. I also use Slogger to make a local copy of all the pages I bookmark.]
- There's no standard format for creating a packaged archive of a web page. There are .webarchive files for WebKit, MHTML (.mht) files for Internet Explorer and Opera, Mozilla Archive Format (.maf) for Mozilla, WAR (tar.gz) files for Konqueror, PDF, etc. but no standard way of creating a zip file with all the HTML, scripts, stylesheets and media needed to recreate a page in full at a later date. iCab apparently saves 'portable web archives' as zip bundles of HTML and associated files. This is probably something that WhatWG should be working on (and have started to do so, at least in a requirements document).
Update: theinfo points to WARC, a "generalization of the ARC format used by the Internet Archive".