Identifying papers with content hashes


James Howison presented his paper "Why can't I manage papers like I manage MP3s". I have one thought about this, which they already addressed pretty well in their paper [PDF]:

Basically, you need to be able to identify the paper, even when comments and markup have been added by the reader. You can't use a hash of the whole file, obviously, as it will have changed. There might have been the option of using something akin to the audioSHA1 hash, which strips off ID3 tags and makes a hash of the naked file - the equivalent of stripping out the XMP markup (comments, etc) - but Acrobat seems to re-optimise PDFs when you resave them, so the file itself will also have changed. Therefore, identifying the paper requires some kind of fingerprinting similar to the TRM that MusicBrainz uses: extracting the text from the PDF and using a combination of unique phrases, frequency of keywords, etc, to produce a unique description of that paper.

This ID could then be used to lookup the metadata from a central server. The benefit of this would be that it could be used on papers in other formats too - XML, RTF or HTML. Note that, as this would identify different versions of a paper that may have changed slightly, or are in different formats, the ID would not be suitable for fetching a file from a distribution network, as it would not guarantee the fidelity of the file - a hash of the whole original PDF would be required for that.