Querying Data Sets using Google BigQuery

Google BigQuery is a "cloud SQL service built on Dremel". It can quickly run SQL queries across large data sets.

Example

Say you've downloaded a relatively small set of TSV files (around 100MB), unpacked them locally, and converted them to UTF-8:

wget --continue 'http://cufts2.lib.sfu.ca/knowledgebase/CUFTS_complete_20120801.tgz' --output-document='knowledgebase.tar.gz'
mkdir knowledgebase
tar -xvzf knowledgebase.tar.gz --directory knowledgebase
rm knowledgebase/update.xml
find knowledgebase/ -type f -exec iconv -f iso-8859-1 -t utf-8 "{}" -o "{}.tsv" \;

Create a new project in Google API Console, enable the BigQuery API service for that project (requires enabling billing), and install bq (command line interface to the BigQuery API):

sudo easy_install bigquery
bq init

Create a new dataset in the BigQuery browser, and set it as your default dataset:

echo dataset_id=cufts >> ~/.bigqueryrc

Multiple tables with different schema

Upload all the CSV files to your BigQuery dataset:

for FILE in `ls knowledgebase/*.tsv`; do
    echo $FILE
    BASENAME=$(basename $FILE .tsv)
    echo $BASENAME
    SCHEMA=`head --lines=1 $FILE | sed -e "s/\t/, /g"`
    echo $SCHEMA
    bq load --encoding="UTF-8" --field_delimiter="\t" --skip_leading_rows=1 --max_bad_records=100 "$BASENAME" "$FILE" "$SCHEMA"
done

Now they can be queried in the BigQuery browser:

SELECT title FROM cufts.aaas, cufts.acs, cufts.doaj WHERE issn = '17605776'

The trouble is that a) you have to specify each table individually in the query, and b) each table has a different schema, so you get an error querying for any field that isn't present in all tables.

A single table with normalised schema

To combine all the tables into one normalised file, create a new project in Google Refine, add all the TSV files, then export the data to a single TSV file. As this file is quite big, store it in Google Cloud Storage so it's available for re-use later:

In the Google API Console, enable the Google Cloud Storage service for your project. Create a new bucket (called "cufts" in this example). Install the Cloud Storage command line interface gsutil, and upload the combined TSV file (using the "-z" option to enable compression):

gsutil cp -z tsv -a public-read cufts.tsv gs://cufts/

Once it's been uploaded to Cloud Storage, import the file to a new BigQuery table:

SCHEMA=`head --lines=1 cufts.tsv | sed -e "s/\t/, /g"`
echo $SCHEMA
bq load --encoding="UTF-8" --field_delimiter="\t" --skip_leading_rows=1 --max_bad_records=100 knowledgebase "gs://cufts/cufts.tsv" "$SCHEMA"

Now you can run queries against the combined data set:

SELECT file, title, e_issn FROM cufts.knowledgebase WHERE issn = '01617761' OR e_issn = '01617761'

Queries can take a few seconds to run, but - as long as they return results before timing out - you can use Javascript to access the query API.

These queries require OAuth 2.0 authentication and a project ID, as queries count towards the quota/billing for the project doing the querying, so it's not possible to allow public queries of the dataset in this way; you'd have to provide an API yourself and handle authentication on the server.