There are now at least 5 different groups of people working on scrapers that collect metadata for a given URL:
- Zotero (Javascript, client-side)
- CiteULike (Any language, server-side)
- Connotea (Perl, server-side)
- BibDesk (Objective C, client-side)
- Bibsonomy (?, server-side)
I'll accept that there's no chance of having them all share the same code for scrapers, even with something like Rhino able to run Javascript on a server, as BibDesk wouldn't be able to use that. There must be a way, though, to describe in XML the methods needed to fetch metadata for a URL.
I imagine it would need a regular expression for matching against the URL, and an XPath or regular expression for matching against the page at that URL. Those attributes would then be used to fetch a new URL containing the metadata, and then you might need another XPath or regular expression for extracting the metadata.
Has anyone else tried to generalise scraping processes?
Comments
All fields are optional, email address will not be shown; no HTML, URLs are automatically hyperlinked.

I've not tried generalizing scrapers, no. But if there was some sort of library or specification, I'd certainly try to take advantage of it! :) It would be useful for some software that I've written.
It's not a bad idea. A regex would do for the URL, but XPath is much easier to use to get at content.
One thing to note is that BibDesk currently just looks for ways to find BibTeX - it's not really scraping individual fields. The sites it supports now all provide a link somewhere that points to a pre-prepared BibTeX entry, and BibDesk just looks for that link (using XPath)
That is, except for the hCite parser, but it's basically irrelevant.
I think it might be possible to just give an XPath for each metadata field. The XPath expressions could get pretty hairy, though...
I forgot to add - if a general JavaScript library was made that just did scraping, and didn't have lots of dependencies (eg, on the Firefox database), we *could* use it in BibDesk. It isn't hard to run arbitrary scripts on pages in WebKit. If scrapers returned a standard JSON description of an item, we could parse that with little problem.
What would be hard is for BibDesk to use the scrapers taken directly from Zotero, which use a bunch of utility classes from the rest of Zotero and depend on Firefox to store data.
There's also the bpr3.org team that's coming out with a science post aggregation system for posts containing literature references. They're just parsing out COinS, but I think they'd qualify as they could benefit from a generalized mechanism as well.
Thanks Michael - so maybe doing all the scraping in Javascript might be feasible after all, as long as XPath is available...