Defining scraper mappings using CSS selectors

I've been doing a lot of web scraping lately.

Generally this involves something along the lines of

<?php
$dom = @DOMDocument::loadHTMLFile($url);
$xml = simplexml_import_dom($dom);
$items = $xml->xpath('//div[@id="foo"]/div[@class="bar"]/a);

The obvious problem being that if the element has the class "bar baz", it won't get matched.

For this, you need to be able to use CSS selectors, which is where Zend_Dom_Query comes in:

<?php
require 'Zend/Dom/Query.php';
$dom = @DOMDocument::loadHTMLFile($url);
$html = new Zend_Dom_Query($dom->saveHTML());
$items = $html->query("#foo .bar");

[insert attempts to use hpricot here]

RDF-EASE is also thinking along these lines, and going further: in this case, the mapping between HTML elements and an RDF data model is defined using an external CSS-like file:

@prefix atom "http://www.w3.org/2005/Atom";

#blog .post {
  -rdf-typeof: "atom:Entry";
}
#blog .post, #blog .post * {
  -rdf-about: nearest-ancestor(".post");
}
#blog .post[id] {
  -rdf-property: "atom:id";
  -rdf-datatype: "xsd:string";
}
#blog .post .title {
  -rdf-property: "atom:title";
  -rdf-datatype: "xsd:string";
}
#blog .post .meta .published {
  -rdf-property: "atom:published";
  -rdf-datatype: "xsd:dateTime";
}

[insert endless digressions into reading arguments about RDFa and Ian Hickson's opinions on the practicality of RDF on the web here]

The trouble there is that you have to parse the CSS file before you can understand the description, which isn't native to most languages. How about describing the mapping using JSON:

{
  "class": "atom:Entry",
  "root": "#blog .post",
    
  "properties": {
    "atom:id": ["attr(id)", "string"],
    "atom:title": [".title a", "string"],
    "atom:content": [".body", "string"],
    "atom:published": [".meta .published", "dateTime"]
  }
}

This seems to work quite well, though I haven't tried it with a wide range of use cases yet. Adding a callback function for post-processing of each property might be useful.

Update: I've put some code in GitHub to illustrate how this can work.

Even this isn't as easy as it might be, though, due simply to inadequate/unstructured markup in the source page. I'm considering two approaches to improving this, at the source: SimpleData (a work in progress), or a Word template for generating structured HTML for specific types of content. Or maybe just a better natural language parser...

Comments

Isn't there a contains($haystack, $needle) in xpath for just this kind of thing?

$items = $xml->xpath('//div[@id="foo"]/div[contains(@class, "bar")/a);

There is, but you have to normalize the class names first (your example would match class="rhubarb"), and it makes the selectors hugely complicated.

check out Web::Scraper for a very nice scraping subsystem for Perl using CSS selectors; I use it a lot...

http://www.slideshare.net/miyagawa/webscraper

Zend_Dom_Query is fine for scraping, in PHP.

The point of the second part of the post is that I'd like to be able to define the semantics of the source document using an external file, preferably using CSS selectors.

Ideally, I'd like to be able to do the whole thing in Javascript, but haven't found a solid way to load an HTML document and run jQuery on the command line, yet.

All fields are optional, email address will not be shown; no HTML, URLs are automatically hyperlinked.