Server-side DOM scraping with Javascript: options

Update: Yahoo! announced YQL Execute yesterday, which allows server-side Javascript (including CSS 3 selectors) to be executed between DOM fetching and returning YQL results. The only problem with YQL is that - because it obeys robots.txt rules - it's often denied access to web content.

Comments

You missed out YQL's new Execute method, which is kind of like stored procedures for DOM scraping written in server-side JavaScript (announced today):

http://developer.yahoo.net/blog/archives/2009/04/yql_execute.html

It may be a shameless plug, but you should really check out ESXX at http://esxx.org/. Scraping an HTML page is a one-liner:

var doc = new URI("http://esxx.org/").load();

The 'doc' variable then contains an E4X node that can be accessed directly. For instance, the expression

doc.body..p[0]

returns the first paragraph, while

for each (let a in doc..a.(/^http:/.test(@href))) {
// use 'a'
}

iterates over all elements with an 'href' attribute that begins with 'http:'.

All fields are optional, email address will not be shown; no HTML, URLs are automatically hyperlinked.