Fetching Web Resources

·

Imagine someone new to writing code for the web. They’ve read Tim Berners-Lee’s books, and understand that there are Resources out there, with URLs that can be used to fetch them. What code do they need to write to fetch and use those Resources?

  1. Use jQuery.ajax.

    Question: What’s “ajax”…?

    Answer: It’s an acronym (AJAX). It stands for

    Asynchronous (fair enough)

    JavaScript (ok)

    And (er…)

    XML (oh.)

  2. Use XMLHttpRequest.

    Hmm…

If you’re working with JSON or HTML (which is probably the case), these interface names make no sense. And that’s before you get into the jQuery.ajax option names (data for the query parameters, dataType for the response type, etc).

As is apparently the way with all DOM APIs, XMLHttpRequest wasn’t designed to be used directly. It also doesn’t return a Promise, though there’s an onload event that gets called when the request finishes. Additionally, query strings are treated as just plain strings, when they’re actually a serialisation of a set of key/value pairs.

The Fetch API is an attempt to improve this situation, but it’s still quite unwieldy (being a low-level interface):

fetch(url).then(function(response) {
    return response.json();
}).then(function(data) {
    // do something with the data
});

What’s really going on, and what should the interface look like?

  1. There’s a Resource on the web, with a URL:

    var resource = new Resource(url)
  2. The URL may have query parameters (filters, essentially):

    var resource = new Resource(url, params)
  3. When an action (get, put, delete) is performed on a Resource, a Request is made to the URL of the resource. This is usually a HTTP request.

    resource.get()
  4. The Resource is available in multiple formats:

    resource.get('json')

    (sets the Accept header to ‘application/json’, and parses the response as JSON)

    resource.get('html')
    (sets the Accept header to ‘text/html’, and parses the response as HTML)
  5. The Resource may be contained in a data wrapper. If the response is JSON, HTML or XML, the browser will parse it into the appropriate data object or DOM document:

    return resource.get('json').then(function(data) {
      return data.item;
    }
    
    return resource.get('html').then(function(doc) {
      return doc.querySelector('#item');
    }
    
  6. Allow the selector(s) for extraction to be specified declaratively, avoiding the use of querySelector directly:

    return resource.get('html', {
      select: '#item',
    });
    
  7. To reduce the amount of code, allow a new instance of a Resource object to be created in a single line:

    return Resource(url).get('json').then(function(data) {
       return data.item;
    });
    
  8. A Collection is a paginated Resource. Given a page of items, it needs to know how to find a) each item in the page and b) the URL of the next/previous page, if there is one:

    Collection(url).get('json', {
      // select the array of items
      items: function(data) {
        return data.artists.items;
      },
      // select the URL of the next chunk
      next: function(data) {
        return data.artists.next;
      }
    }).then(function(items) {
      // do something with the items
    });
    
  9. Instead of sending hundreds of requests to the same domain at once, send them one at a time: each Request is added to a per-domain Queue. When one request finishes, the next request in the queue is sent.

Implementations

x-ray is a really nice implementation of a scraper for extracting collections of data from HTML web pages. It doesn’t extend to other data formats, though.

web-resource is my JavaScript library that implements the Resource and Collection interfaces described above.