ecsstract: Scraping in XULRunner with JSON/CSS selectors

·

I've created a project on GitHub that contains a) an XULRunner application for scraping web pages (basically Crowbar* with all the RDF stuff taken out) and b) a sample PHP script and JSON definitions file for scraping an example web page.

Running the application requires Firefox 3.1b (Windows, Mac OS X) or XULRunner 1.9.1 (Linux), as it makes use of querySelector and querySelectorAll.

The idea is that the definitions for scraping a page are defined as JSON, like this:

{
  "name": "Borderline",
  "enabled": 1,
  
  "url": "http://www.mamagroup.co.uk/borderline/index.html",  
  "root": "#table_listings tr",
  
  "properties": {
    "dc:identifier": [".buytixlink a", ["attribute", "href"]],
    "dc:title": [".lst_head", "text"],
    "dc:description": ["", "html"],
    "event:price": [".lst_price", "text"]
    "dc:date": [".lst_date", "text", "var d = Date.parseDate(text, 'H:i (l j F)'); if (d) return d.getTime();"],
  }
}

The "root" is the selector for the repeating element, and the "properties" are pairs of property names and selectors to be applied to each "root" element. There are also extra property definitions: the "type" of property ('text', 'html', 'innerhtml', 'attribute' or 'match' (like 'text' but with a regular expression)) and a Javascript function for manipulating the extracted text. I suspect these parameters will need to be expanded to cover more real-world examples, but they seem to work well so far.

Those definitions are then passed to a scraper function injected into the web page to be scraped, in a sandbox, which returns an array of extracted elements and their properties.

I would have liked to be able to inject jQuery too, but it looks like XULRunner doesn't provide the right environment. If necessary, ecsstract could run as a Firefox extension, perhaps.

The next step is to make sure it runs headless (no alert boxes, for example) and test it under Xfvb.

*Thanks, all of joshua's Delicious followers for reminding me about Crowbar.