I started making a Drupal module that would pull in the RSS feed of my bookmarks from del.icio.us, store each bookmark as a Drupal node, store a cached version of each page and post it to Solr for searching.
It's working ok so far, disk space pending. This project, though, illustrates a number of problems:
- Web pages disappear - obviously, but more than you might think. Luckily I have the timestamp for each bookmark and should be able to use the Internet Archive to get a version of the page for indexing. Even so, there are a lot of bookmarked pages that it's a shame (but unavoidable, basically) to not be able to link to any more.
- When the server fetches a copy of the page for caching it doesn't necessarily see the same as I see in a web browser, particularly if it's a site where I'm logged in. [Zotero solves this problem nicely by running inside the browser and creating snapshots client-side. I also use Slogger to make a local copy of all the pages I bookmark.]
- There's no standard format for creating a packaged archive of a web page. There are .webarchive files for WebKit, MHTML (.mht) files for Internet Explorer and Opera, Mozilla Archive Format (.maf) for Mozilla, WAR (tar.gz) files for Konqueror, PDF, etc. but no standard way of creating a zip file with all the HTML, scripts, stylesheets and media needed to recreate a page in full at a later date. iCab apparently saves 'portable web archives' as zip bundles of HTML and associated files. This is probably something that WhatWG should be working on (and have started to do so, at least in a requirements document).
Update: theinfo points to WARC, a "generalization of the ARC format used by the Internet Archive".
Comments
All fields are optional, email address will not be shown; no HTML, URLs are automatically hyperlinked.

I like this idea, it's been something on my mind for a long time.
For the packaged archive, I quite like the "HTML complete" option for saving a webpage in Camino, which I think is the same in Firefox. I think it does a pretty good job of saving the webpage with the proper link references to a local media folder with all the .css, image and other bits. I haven't looked to see if it keeps all the functional .js files etc. I usually save a page that way and zip that up with the local folder associated with the .html file.
The thing is, a page is not a dead end, the content probably has other links and one may have to consider how deep one has to go to maintain the integrity of the content. And that's hard to mathematically code because it could be very subjective.
where's the code?
tnx
I'll post some code shortly, when it's working a bit better.
Nice idea. You've come up on some sticky problems, though, with saving the HTML. One other thing that I do before saving in Firefox Scrapbook (similar to Slogger) or Zotero is modifying the layout of the page via Aardvark (http://karmatics.com/aardvark/) to remove cluttering elements. Or I'll use something like the repagination plugin (https://addons.mozilla.org/en-US/firefox/addon/2099) to put together a multi-part article onto one HTML page (where something like "print version" is not supplied).
All of those things happen on the client side, which makes it difficult to get them into the Drupal database. It sounds like a plugin might be required to get the client-side representation up to the server...