Defining scraper mappings using CSS selectors


I've been doing a lot of web scraping lately.

Generally this involves something along the lines of

$dom = @DOMDocument::loadHTMLFile($url);
$xml = simplexml_import_dom($dom);
$items = $xml->xpath('//div[@id="foo"]/div[@class="bar"]/a);

The obvious problem being that if the element has the class "bar baz", it won't get matched.

For this, you need to be able to use CSS selectors, which is where Zend_Dom_Query comes in:

require 'Zend/Dom/Query.php';
$dom = @DOMDocument::loadHTMLFile($url);
$html = new Zend_Dom_Query($dom->saveHTML());
$items = $html->query("#foo .bar");

[insert attempts to use hpricot here]

RDF-EASE is also thinking along these lines, and going further: in this case, the mapping between HTML elements and an RDF data model is defined using an external CSS-like file:

@prefix atom "";
#blog .post {
  -rdf-typeof: "atom:Entry";
#blog .post, #blog .post * {
  -rdf-about: nearest-ancestor(".post");
#blog .post[id] {
  -rdf-property: "atom:id";
  -rdf-datatype: "xsd:string";
#blog .post .title {
  -rdf-property: "atom:title";
  -rdf-datatype: "xsd:string";
#blog .post .meta .published {
  -rdf-property: "atom:published";
  -rdf-datatype: "xsd:dateTime";

[insert endless digressions into reading arguments about RDFa and Ian Hickson's opinions on the practicality of RDF on the web here]

The trouble there is that you have to parse the CSS file before you can understand the description, which isn't native to most languages. How about describing the mapping using JSON:

  "class": "atom:Entry",
  "root": "#blog .post",
  "properties": {
    "atom:id": ["attr(id)", "string"],
    "atom:title": [".title a", "string"],
    "atom:content": [".body", "string"],
    "atom:published": [".meta .published", "dateTime"]

This seems to work quite well, though I haven't tried it with a wide range of use cases yet. Adding a callback function for post-processing of each property might be useful.

Update: I've put some code in GitHub to illustrate how this can work.

Even this isn't as easy as it might be, though, due simply to inadequate/unstructured markup in the source page. I'm considering two approaches to improving this, at the source: SimpleData (a work in progress), or a Word template for generating structured HTML for specific types of content. Or maybe just a better natural language parser...