"git scraping" data from the Office for National Statistics API

The Office for National Statistics produces a regularly-updated dataset - “Deaths registered weekly in England and Wales, provisional” - as an Excel file each week, plus a summary Excel file per year.

There is also an API which provides the same data as CSV.

To make this data accessible to web applications, I’m going to combine the historical data (assembled manually) in a git repository with the latest dataset from the ONS API, using a GitHub Action to keep it up to date (as described by Simon Willison).

  1. In GitHub, create a new repository.
  2. Add a metadata.yml file describing the data source and license.
  3. Add a requirements.txt file containing datasette and sqlite-utils.
  4. Create a "fetch" workflow with a schedule trigger that uses cURL to fetch the latest version of the dataset to a file. If the data has been updated when the scheduled workflow runs, it will be committed and pushed back to the repository. Note: use at least ubuntu-20.04 to get a version of curl that supports the --no-progress-meter option.
  5. Create a "deploy" workflow which uses sqlite-utils to insert the data from the CSV file(s) into a SQLite database, then uses datasette to publish the SQLite database as a web service with Cloud Run.
  6. Optionally, add a CNAME for a subdomain and associate the subdomain with the deployed service in Cloud Run.

For Cloud Run, you need to create a project, enable the Cloud Build and Cloud Run APIs, create and download a JSON private key for the service agent and add it to the repository’s secrets, and allow unauthenticated access to the deployed service.

Here's the git repository containing the workflows and data and the resulting datasette-powered web service.