There are now at least 5 different groups of people working on scrapers that collect metadata for a given URL:
- CiteULike (Any language, server-side)
- Connotea (Perl, server-side)
- BibDesk (Objective C, client-side)
- Bibsonomy (?, server-side)
I imagine it would need a regular expression for matching against the URL, and an XPath or regular expression for matching against the page at that URL. Those attributes would then be used to fetch a new URL containing the metadata, and then you might need another XPath or regular expression for extracting the metadata.
Has anyone else tried to generalise scraping processes?