CSV on the Web, with PHP

The W3C has launched a new "CSV on the Web" Working Group, part of the W3C Data Activity. I thought this would be a good time to look at how I've been using CSV files recently.

Data analysis projects these days seem to often involve the same steps:

  1. Fetch documents (usually HTML, JSON or XML) from a remote web site.
  2. Extract data from those documents, and save the extracted data as rows in a CSV file.
  3. For each row in the CSV file, fetch more data from a different remote web site.
  4. Parse that data and save the extracted data as rows in another CSV file.
  5. Publish the data.

Here's a bit more information on the steps involved:

Fetching with cURL

To fetch data, PHP's curl_init, curl_setopt, curl_exec and curl_getinfo commands all work nicely. I find it useful to wrap them in a CurlClient class, with subclasses for any non-standard web services that need special handling, as this makes it easy to set all the basic options, handle rate limiting, and re-use the same cURL handle, allowing connections to be kept open between requests.

Setting CURLOPT_ENCODING to 'deflate,gzip' allows the server to send compressed data, and a combination of gzopen, CURLOPT_FILE and gzclose makes cURL write the response straight to a compressed file without needing to parse it.

Parsing the response

If the data is JSON, use json_decode to turn it into an array. If it's HTML or XML, use DOMDocument and DOMXPath to extract the data. DOMXPath::evaluate('string(…)') is particularly useful for queries that should return a single piece of text.

Storage

CSV is an obvious choice for storage, as it's so straightforward to use for objects that have a single layer of properties without any nested objects.

fputcsv is easy enough to use; fgetcsv is a bit trickier. The general procedure is something like this:

$input = gzopen('input.csv.gz', 'r'); // open the compressed input file
$fields = fgetcsv($input); // read column headers
$field_count = count($fields); // count the number of fields

$output = gzopen('output.csv.gz', 'w'); // open the compressed output file
fputcsv($output, array('id', 'name')); // write column headers

// read each line of the input
while (($row = fgetcsv($input)) !== false) {
  $row = array_pad($row, $field_count, null); // make sure enough columns are present
  $item = array_combine($fields, $row); // map each column to an associative array

  // do something with $item

  // build the data to save
  // use field names to help keep the header row in sync
  $data = array(
    'id' => 'foo',
    'name' => 'bar',
  );

  fputcsv($output, $data);
}

It might be easier to use SplFileObject::READ_CSV instead of the complicated while line above:

$input = new SplFileObject('compress.zlib://input.csv.gz');
$input->setFlags(SplFileObject::READ_CSV);

foreach ($input as $i => $row) {
  if ($i === 0) {
    $fields = $row; // read column headers
    $field_count = count($fields); // count the number of fields
    continue;
  }

  $row = array_pad($row, $field_count, null); // make sure enough columns are present
  $item = array_combine($fields, $row); // map each column to an associative array

  // handle each row
}

In fact, you can define a function that opens and reads the CSV file, and pass it a callback to use for each row:

$counts = array();

read_csv('input.csv', function($item) use (&$counts) {
  $name = $item['name'];

  if (!isset($counts[$name])) {
    $counts[$name] = 0;
  }

  $counts[$name]++;
});

// sort the data
arsort($counts);

// open the output file
$output = fopen('output.csv', 'w');

// write column headers
fputcsv($output, array('name', 'count'));

foreach ($counts as $name => $count) {
  fputcsv($output, array($name, $count));
}

function read_csv($file, $callback, $header = true) {
  // use 'compress.zlib://…' for a compressed file
  $input = new SplFileObject($file);
  $input->setFlags(SplFileObject::READ_CSV);

  foreach ($input as $i => $row) {
    if ($header) {
      // read the header row
      if ($i === 0) {
        // read column headers
        $fields = $row;
        // count the number of fields
        $field_count = count($fields);
        continue;
      }

      // make sure enough columns are present
      $row = array_pad($row, $field_count, null);

      // map each column to an associative array
      $row = array_combine($fields, $row);
    }

    call_user_func($callback, $row);
  }
}

Publishing

There are several ways to publish a CSV file online:

  1. Simply putting the CSV file online is often enough. It's a standard format and people can easily make use of it.
  2. Import the file to a Google Spreadsheet and publish it. From there, people can use libraries like Tabletop.js and sheetsee.js to work with the data, or can download the raw data. The limits on spreadsheet size have recently been raised, so it should be able to handle fairly large files.
  3. Import the file into a Google Fusion Table and publish it. Suitable for large, static files.
  4. Import the CSV file to dat, which will start a dat server and allow people to pull updates.

Describing

Although the column headers of a CSV file are often enough to describe the data fairly well, there are several extra pieces that it's helpful to provide:

  • A README file that lists the data sources, and describes how the data was generated.
  • The code that was used to fetch, process and output the data, with instructions for how to run it to produce the same output.
  • A LICENSE file that describes the license of the data - ideally one of the Open Data Commons licenses.
  • A JSON file accompanying the CSV file, containing metadata. Ideally, this will define the data type of each field. Simple Data Format (SDF) is one possible format for this datapackage.json metadata file. Perhaps this file could even contain a mapping of each property to a URL, analogous to a JSON-LD context document (CSV-LD?).

The Open Knowledge Foundation have a useful page describing CSV and related formats.