Visual Scrapers

A couple of visual scraper utilities showed up recently: the web-based Dapper and desktop-based OpenKapow.

OpenKapow is a >100MB download, ~80MB of which is the standard Java 1.5 runtime environment, which seems a bit ridiculous. It's an impressive piece of software once you get it running, with just about every option available. I couldn't work out how to get it to produce any output though.

Dapper is visually slick, takes you through the process smoothly (with a few unnecessary sidesteps), but is too limited to be useful - it's impossible to select important elements in the page, as there's no tree view of the DOM.

So it's back to file_get_contents > Tidy > SimpleXML > XPath and regular expressions, which works well enough.

Comments

Agreed about Dapper and OpenKapow. I've never even bothered dl'ing OpenKapow, it's just far too daunting!

I tend to use feed43.com -- nice and simple regexp-ish patterns, with a really great AJAX UI to instantly preview what you're doing. It's the best UI for quick-scrape hacks I've found so far, and I've been messing with this stuff since last century when I wrote Sitescooper.

Thanks Justin - I managed to make the feed easily with feed43 that I'd failed to make with Dapper and OpenKapow.

All fields are optional, email address will not be shown; no HTML, URLs are automatically hyperlinked.