There are now at least 5 different groups of people working on scrapers that collect metadata for a given URL:
- Zotero (Javascript, client-side)
- CiteULike (Any language, server-side)
- Connotea (Perl, server-side)
- BibDesk (Objective C, client-side)
- Bibsonomy (?, server-side)
I'll accept that there's no chance of having them all share the same code for scrapers, even with something like Rhino able to run Javascript on a server, as BibDesk wouldn't be able to use that. There must be a way, though, to describe in XML the methods needed to fetch metadata for a URL.
I imagine it would need a regular expression for matching against the URL, and an XPath or regular expression for matching against the page at that URL. Those attributes would then be used to fetch a new URL containing the metadata, and then you might need another XPath or regular expression for extracting the metadata.
Has anyone else tried to generalise scraping processes?