Extracting microcontent (XSLT, GRDDL, RDF)


Here's my favourite method of extracting microcontent from (X)HTML so far:

First of all, add the following to your userContent.css, so that known microformats will show up in the browser:

 -moz-outline: 2px solid invert !important;

Second, install this Greasemonkey script.

This script adds an item to Firefox's menu ('Extract Microcontent', under Tools>User Script Commands) which, when selected, will look in the current page for a <profile> link.

If it finds one, it will fetch the profile page and check it for a 'rel=profileTransformation' link. If that exists, it will fetch the linked XSLT file and process the current page, loading the output in a data URI.

If it doesn't find a linked profile, or the profile doesn't contain a link to an XSLT file, it will go through a list of known microformats; if it finds one of these microformats in the current page, it will fetch the appropriate XSLT file, transform the document and load the output in a data URI.

The end product will depend on the XSLT file, but in most cases so far will be RDF. There's a problem with the vcard and vevent formats, as they should translate to text (not XML) but the newlines are lost in the transition to a data URI; I haven't found a way around that yet.

The benefits of this approach are that a) the script doesn't have to scan every page for a big list of classNames and b) the XSLT files can be updated independently of the Greasemonkey script (though this brings accompanying issues of security and reliability).

One problem at the moment is that a lot of the XSLT files aren't robust enough to cope with the variety of user-generated microformat data that exists (the Greasemonkey script already translates classnames to lowercase, but there are many other inconsistencies that make the documents hard to process).