Google BigQuery is a "cloud SQL service built on Dremel". It can quickly run SQL queries across large data sets.
Example
Say you've downloaded a relatively small set of TSV files (around 100MB), unpacked them locally, and converted them to UTF-8:
wget --continue 'http://cufts2.lib.sfu.ca/knowledgebase/CUFTS_complete_20120801.tgz' --output-document='knowledgebase.tar.gz'
mkdir knowledgebase
tar -xvzf knowledgebase.tar.gz --directory knowledgebase
rm knowledgebase/update.xml
find knowledgebase/ -type f -exec iconv -f iso-8859-1 -t utf-8 "{}" -o "{}.tsv" \;
Create a new project in Google API Console, enable the BigQuery API service for that project (requires enabling billing), and install bq (command line interface to the BigQuery API):
sudo easy_install bigquery
bq init
Create a new dataset in the BigQuery browser, and set it as your default dataset:
echo dataset_id=cufts >> ~/.bigqueryrc
Multiple tables with different schema
Upload all the CSV files to your BigQuery dataset:
for FILE in `ls knowledgebase/*.tsv`; do
echo $FILE
BASENAME=$(basename $FILE .tsv)
echo $BASENAME
SCHEMA=`head --lines=1 $FILE | sed -e "s/\t/, /g"`
echo $SCHEMA
bq load --encoding="UTF-8" --field_delimiter="\t" --skip_leading_rows=1 --max_bad_records=100 "$BASENAME" "$FILE" "$SCHEMA"
done
Now they can be queried in the BigQuery browser:
SELECT title FROM cufts.aaas, cufts.acs, cufts.doaj WHERE issn = '17605776'
The trouble is that a) you have to specify each table individually in the query, and b) each table has a different schema, so you get an error querying for any field that isn't present in all tables.
A single table with normalised schema
To combine all the tables into one normalised file, create a new project in Google Refine, add all the TSV files, then export the data to a single TSV file. As this file is quite big, store it in Google Cloud Storage so it's available for re-use later:
In the Google API Console, enable the Google Cloud Storage service for your project. Create a new bucket (called "cufts" in this example). Install the Cloud Storage command line interface gsutil, and upload the combined TSV file (using the "-z" option to enable compression):
gsutil cp -z tsv -a public-read cufts.tsv gs://cufts/
Once it's been uploaded to Cloud Storage, import the file to a new BigQuery table:
SCHEMA=`head --lines=1 cufts.tsv | sed -e "s/\t/, /g"`
echo $SCHEMA
bq load --encoding="UTF-8" --field_delimiter="\t" --skip_leading_rows=1 --max_bad_records=100 knowledgebase "gs://cufts/cufts.tsv" "$SCHEMA"
Now you can run queries against the combined data set:
SELECT file, title, e_issn FROM cufts.knowledgebase WHERE issn = '01617761' OR e_issn = '01617761'
Queries can take a few seconds to run, but - as long as they return results before timing out - you can use Javascript to access the query API.
These queries require OAuth 2.0 authentication and a project ID, as queries count towards the quota/billing for the project doing the querying, so it's not possible to allow public queries of the dataset in this way; you'd have to provide an API yourself and handle authentication on the server.