Creating a citable archive of a web page

Academic papers or weblog posts often need to refer to external web pages; generally, you want people to see the external pages as they were when you wrote about them.

The simplest way to do this is a standard hyperlink, combined with a quote of the appropriate section of the text. If you're referencing long pages though, lots of lengthy quotes could get out of hand.

You could save the external web page and host a copy of it locally, but this is troublesome and could be unreliable (as URLs change over time).

You could point to The Wayback Machine (archive.org)'s cache of the page, which Simpy does for stored bookmarks, but you can't guarantee that archive.org will have a cache of the page from when you looked at it, as there's no way to trigger an import into the archive. You could also use Google's cache, but that only stores the most recently crawled version of the page.

You could use a bookmarking service such as Furl or Yahoo's My Web, which store a cached version of the web page when you bookmark it. This is a good solution, but it only allows you to store one cached version of each page, so if you bookmark the same page again in the future it will overwrite the original cache (though adding a random string, or the date, to the end of the URL would perhaps be one way to get around this limitation).

Or, you could use a service that some colleagues of mine have produced called WebCite. Triggering the import of a web page via a bookmarklet stores a dated copy of that page in the archive (if that's allowed by the original site), which can then be referred to - indefinitely - by a unique URL. Another way to import referenced web pages is by uploading a paper that will be automatically parsed for hyperlinks: this is a model that could be used to provide ongoing support for the archive, as member publishers use it to cache web pages referred to by their articles.

Obviously there will have to be limits on how much anyone can use the WebCite archive - i.e. not for making a backup of every page on their site every day - but that will probably depend on the patterns of usage.

Comments

I tried todo something like this here, with the added feature that you could like to an individual line. The goal was to improve the quality of online debate.

I think this is a terrible idea, because the web is a dynamic medium by its very nature.

Hyperlinks to people's webpages should point to their current webpages, not to any sort of old cached version, and to do otherwise is to breach the trust the original author of the page has extended by allowing linkage in the first place. If they want to to be static and permanent, they will make arrangements to that end. If they want it to be merely a scratch-pad, you shouldn't try to treat it as permanent.

The things which need to be static, such as peer-reviewed academic publications, have already solved the problem of permanency. The articles are printed and archived on the publishers website, where they are expected to be, in perpetuity.

If you did link to a static version, you would have to mention(in good faith) that you are linking to an old version of a page, and doing that every time would be almost as unwieldy as just including the snippet itself.

Unless part of what you're trying to say is that so-and-so said this on this date, you're much better off linking to the current version of the page.

I understand that you would want to refer to something static, so you won't look stupid when your reference points to something no longer informative, or worse, has been changed to contain contradictory rather than supportive information, but that just means that you shouldn't be linking to ephemeral webpages in any serious work of scholarship, anyways.

There are two spheres of content. One is static and permanent, and one is ephemeral and dynamic, and you shouldn't try to convert one into the other.

Posted by: Grady on November 16, 2005 3:21 PM

Grady, I don't really agree with what you're saying (see the trackback from Lost Boy above for some links to more detailed studies and reasoning why this is needed, particularly for academic writing), but I do agree that the main link used should be to the original URL, and that the cached version should be an extra link (or perhaps added as an onclick handler), so as not to break the standard web linking mechanisms.

The concern is that future readers will be misled, perhaps intentionally or perhaps not, into thinking that a cached version represents the current views of the original author, and the original author may have no way of knowing this is happening.

I agree that 48% unretrievable citations is a serious problem, but I think it comes about from a misunderstanding; perhaps people wanting the web to be something it's not. It's certainly not, in its current implementation, an authoritative storehouse of all content ever made available, but rather a combination of permanent and ephemeral content. You know this, of course, but I think the distinction needs to be made explicitly, if only to keep people mindful of the fact that webpages, as opposed to published articles, can be easily modified or defaced in a way that is transparent to future readers.

In other words, there's no such thing as a peer-reviewed web page.

Posted by: Grady on November 16, 2005 3:55 PM

Grady - in every web page caching system I've seen (Google, Archive.org, WebCite, etc), there's a large, clear banner at the top of the page that states that this is a cached version of the page, along with the date at which the cached version was stored.

I don't think anyone's likely to be seriously misled into misunderstanding what's going on - they should also be able to realise that following the link presented will show them the most current version of the page, if it still exists.

Of course, if you don't want people to be taking snapshots of your web pages, you shouldn't be putting them online in the first place (or you should at least be excluding caching in your robots.txt file).

The larger issue for me is that a system for reputation and believability for any old webpage or blog post hasn't been worked out, so to me they all default to hearsay, equivalent to the statements you find in publications cited as (W. Gunn, personal communication). So to use some statement from some guy's webpage in a serious work of scholarship just sounds wrong, but perhaps you were talking about using the page as data itself, rather than the comments posted therein.

Posted by: Grady on November 16, 2005 4:26 PM

Alf, this is a great idea and you and your colleagues seem to have executed it very well :-) Are you going to support a REST API, and perhaps make the archive available via oai-pmh? If you are looking for any volunteers I'd be interested in helping out.

Ed - the site isn't my work at all, but I'll certainly pass along your questions to those responsible.

All fields are optional, email address will not be shown; no HTML, URLs are automatically hyperlinked.