Using Google Fusion Tables to provide an API to data files

This example imports a large TSV file (all the data from the CUFTS Knowledgebase) into a Google Fusion Table, then uses the Fusion Table API to provide a JSON API for querying.

Here's an example table, and an example interface.

  1. Download the latest CUFTS knowledgebase data file:
    wget --continue 'http://cufts2.lib.sfu.ca/knowledgebase/CUFTS_complete_20120801.tgz' --output-document='cufts.tar.gz'
  2. Extract the data and remove any unwanted files:
    mkdir cufts
    tar -xvzf cufts.tar.gz --directory cufts
    rm cufts/update.xml
  3. Convert the TSV files to UTF-8 encoding:
    find cufts/ -type f -exec iconv -f iso-8859-1 -t utf-8 "{}" -o "{}-utf8.tsv" \;
  4. To merge multiple files together, create a new project in Google Refine and add all the UTF-8 TSV files, then export the data from Google Refine to a single TSV file. If this exported file is over 100MB, split it into smaller pieces:
    tail -n +2 cufts.tsv | split -l 500000 - cufts-split-
    for file in cufts-split-*; do
      head -n 1 cufts.tsv > tmp_file
      cat $file >> tmp_file
      mv -f tmp_file $file
    done

    Alternatively, add each file to the Fusion Table individually, either manually or using the API.

  5. In Google Drive choose Create > Table and import the TSV file to create a new Fusion Table. To add more data, choose File > Import more rows.
  6. Edit the Fusion Table's sharing settings to make it public.
  7. Create a new project in Google APIs Console, enable Fusion Tables API in the "Services" section, and copy the API key from the "API Access" section, for use when querying.
  8. Clone this GitHub Pages repository, and use it to create your own interface to the Fusion Table data. Basically the queries are SQL, e.g.
    SELECT file, title FROM abcdefghijk WHERE issn = '1234-567X'
    and querying in jQuery is simple (the responses have a CORS header so can be accessed cross-domain):
    $.ajax({
      url: "https://www.googleapis.com/fusiontables/v1/query",
      data: {
        key: "YOUR API KEY",
        sql: "YOUR SQL QUERY"
      }
    });

I have noticed that queries can be (variably) quite slow on large tables like this one, but hopefully that's temporary…