YQL Open Data Tables

I nearly got as far as setting up an Open Data Table definition for WeFollow, so that it could be queried using YQL. Sadly the HTML parser that YQL exposes for arbitrary URLs isn't available to web services defined using Open Data Tables - they have to return either well-formed XML or JSON.

Still, I like the style of the Open Data Table definitions, and found some useful documentation in Chapter 5 of the Yahoo Query Language guide ("Using YQL Open Data Tables").

There are lots of community-contributed definitions in Sam Pullara (of Yahoo!)'s GitHub.

Also, it would be great if YQL could use CSS selectors, in the way that Freebase's Acre does using Sizzle and Rhino, alongside the existing XPath selectors that are available for use on parsed HTML.

Comments

Alf: would either of these be solutions for my desire for much easier to write and generic scrapers for Zotero, etc.? Seems yes, but am not sure.

Posted by: Bruce on March 16, 2009 4:19 AM

I think the difficult part of creating a structure for defining scrapers is making it generic enough that it's easy to understand, but adaptable enough that it can cope with any situation. Open Data Tables seem designed to cope with well-formed XML, whereas RDF-EASE and the JSON-defined scraping definitions that I'm working on will work with anything that can be parsed as HTML.

Sizzle and Firefox 3.1's document.querySelectorAll make it possible to define scraping rules using CSS selectors which, along with XPath, make selecting nodes pretty easy. I don't know if Zotero is able to require Firefox 3.1 in order to use querySelectorAll in its scrapers yet...

The remaining complexity is the inevitable pre- and (especially) post-processing that data needs once nodes/text have been selected. In this respect, having Zotero's scrapers in individual Javascript files/functions probably makes good sense.

So I'd say the combination of a simple, structured and expressive way to define scrapers (much like CSL does for styles), in combination with straightforward selectors and an extensible mechanism to manipulate data once it's been selected, are what Zotero and co need.

All fields are optional, email address will not be shown; no HTML, URLs are automatically hyperlinked.