Server-side scraping with Javascript


Update 2007-12-03: Runs in Zotero now as well. Moved source to Google Code repository.

Update 2007-11-14: Rewritten to remove E4X dependency so more likely to run in WebKit, and to make functions more in line with Zotero's.

The code, available through svn, contains:

  • rhino.jar: Rhino 1.6R7, a Java-based implementation of JavaScript that can be run from the command line.
  • env.js and jquery.js from John Resig, which make jQuery functions available in Rhino.
  • test.js, which maps URLs to translators and contains the main function.
  • utilities.js, which mimics the Zotero object and contains some utility functions.
  • amazon.js, which - given an Amazon URL - scrapes an ASIN and looks up the metadata using Amazon's ECS web service, returning a metadata object. This is based on the Amazon translator in Zotero, but modified to use jQuery functions.
  • tidy-proxy.php, which fetches a URL and runs it through Tidy. You'll need to place it so it's accessible at and have Tidy and the Tidy extension for PHP 5 installed.

What this does:

When you run

Rhino should load the test.js file, which will pull in the other .js files. It'll then fetch an item page from Amazon, convert it to XHTML using the Tidy proxy, load it and call two functions loosely based on Zotero translators. The first function will detect the type of item ("Book" in this case). The second function will detect the ASIN, look up the metadata from ECS, parse the XML and produce a metadata object that can then be passed to a bibliographic manager.

The point of this is to try and make Javascript scrapers that will run in Firefox (for Zotero), WebKit (for BibDesk and Papers) and Rhino (server-side, for Connotea, CiteULike, Bibsonomy, etc).