About reviews and microformats

Once upon a time, Seb Paquet was talking about structured blogging, and how it would be useful if reviews that people posted on their own sites were marked up so that they could be aggregated and analysed.

I made a tool for people to use to post reviews to their sites, fetching data from Amazon and formatting a post with metadata embedded as RDF. The idea was that weblog harvesting sites like AllConsuming could pick up this metadata - the most important being the rating - and aggregate it in a central location. Unfortunately, Movable Type with 'Convert Line Breaks' enabled would put a <br> after every line, thus breaking the RDF, which basically made this approach impractical.

I also made a prototype weblog system that could store different types of posts, based on Blosxom but using MySQL for storage and using different input forms for each type of post (reviews, photos, etc.) The idea then was that weblogs could publish this data as part of their RSS feeds using a standard format (such as RVW), so that the harvesting tools could collect the information this way.

The main problem with this approach was that the weblog storage systems (using MySQL or text files or whatever) weren't designed to be extensible with extra fields for identifiers, ratings, etc. There were hacks to put name:value pairs into the keywords field, but that wasn't ideal. The APIs were also unable to handle this data, so desktop tools weren't able to format and post extended fields either.

This problem seems to be what hReview (and most of Technorati's microformats, as well as structuredblogging.org's plugins) seem designed to address: all of the semantic markup is wrapped around the data in the (X)HTML itself, so any old tool can format and post reviews without needing any special storage capabilities.

What I don't like about this is the possibility that you might want to add or change certain fields in the future, eg to go back and add 'record label' metadata to your album reviews, or 'price' metadata to your hotel reviews. Without proper database storage for each field it might be difficult having to reformat each semantically marked-up post to include the new fields.

If weblogs in the future will all be built on XML databases - which I guess could mean storage as XHTML - then posts will be searchable and manipulatable by XPath/XQuery while being published as Atom and X/HTML (and perhaps separately as RDF, if necessary). I'd like to know, therefore, whether embedding the metadata in the text, rather than storing it in separate fields, is a feasible and future-proof way to work.

I guess the question is: is XHTML a good enough storage system?


One other thing about hReview and other proposed formats: the rating is the most important item of value, and they seem to have taken the route I took for the first version of RVW - using a default rating scale of 1-5, plus optional rating, minimum and maximum values. However, if all the rating data has to be normalised at some point, it makes sense to get everyone to publish their data in the same way, rather than every aggregator having to work out what each rating actually means. This is why a mandatory percentage scale makes more sense (people can still enter their ratings on whatever scale they like, it's just converted, stored and published as a percentage - see the rvw! tool for an example of this). Metacritic, Pitchforkmedia, Jason Kottke, IMDB, iTunes and my own reviews posted to del.icio.us (for lack of a more specialised storage system) all publish their ratings on a 'two significant digits' scale.


Sidenote: I could have linked to a million other related things here, but tried to keep the focus as narrow as possible. Here are a few recent items from people who have been interested in this for a long time ...
Les Orchard
Danny Ayers
Phil Wilson
Ken MacLeod
Phillip Pearson