Archiving Timestamped Copies of Bookmarked Web Content

  1. As I'd been wanting to bookmark content that's behind paywalls (e.g. journal articles) for reading later in Instapaper, I made a bookmarklet last week that posts the content of the current page to an App Engine app, which stores the content, creates a new URL, then redirects to Instapaper to bookmark the new URL (which provides the full content) instead of the original access-controlled URL (which doesn't).

    Today, Mendeley announced their new web importer bookmarklet, which—as well as providing the standard bookmarking functions— uses the browser to post a full copy of the bookmarked page to Mendeley, which stores the snapshot for you. You can export the HTML snapshot, but it's just the raw HTML: there's no rewriting or caching of included files such as images or CSS, so they'll either be broken or loaded from the orginal server.

  2. For a few years, I've been keeping a local copy of all my Delicious bookmarks, fetching the HTML of each bookmark when it's added for archiving and searching.

    Recently, I signed up for the $25/year archive service from Maciej Ceglowski's Pinboard, which creates a proper archive (including images, etc) of all your bookmarks as they're added (including continuous importing from Delicious). They're working on an export function - probably a zipped bundle of all the archived pages. Pinboard only archives the version of the page that it can see, though - not the version that you saw in your browser, if it's behind a paywall or authentication.

  3. What I'd really like to see is an archive of each bookmark—as you saw it in your browser at the time—in WARC format, which preserves both the HTTP headers and the content of the response. That way, whenever you cite a web page, you can included a timestamped, cached version of the cited page, as you saw it.

    The Internet Archive has a WARC renderer, but this would probably also need a browser extension that can read WARC files—or perhaps use a different format (HTML with all the content inlined as data: URIs, for example, or MHTML)— and display it as if the reader were seeing the original page.

    There are some privacy concerns around publishing an exact snapshot of a page that's behind authentication, in case there are any private keys or other information stored somewhere in the page...