Searching for mergeable tables

·

Among CartoDB’s many useful features is the ability to merge tables together, via an interface which lets you choose which column from each to use as the shared key, and which columns to import to the final merged table.

Google's Fusion Tables similarly encourages merging of tabular data. Fusion Tables creates a virtual merged table, allowing updates to the source tables to be replicated to the final merged table as they occur.

CartoDB can also merge tables using location columns, counting items from one table (with latitude and longitude, or addresses) that are positioned within the areas defined in another table (with polygons).

I've found that UK parliamentary constituencies are useful for visualising data, as they have a similar population number in each constituency and they have at least two identifiers in published ontologies which can be used to merge data from other sources*. The UK parliamentary constituency shapefiles published by the Ordnance Survey as part of the Boundary-Line dataset contain polygons, names and two identifiers for each area: one is the Ordnance Survey’s own “unit id” and one is the Office for National Statistics’ “GSS code”.

Once the parliamentary constituency shapefile has been imported to a base table, any CSV table that contains either of those identifiers can easily be merged with the base table to create a new, merged table and associated visualisation.

So, the task is to find other data sets that contain either the OS “unit id” or the ONS “GSS code”.

The URLs for the data types of these codes are defined in the Ordnance Survey’s “administrative geography and civil voting area” ontology:

The values themselves can also be expressed as URLs:

GSSE14000929
GSS URLhttp://statistics.data.gov.uk/doc/statistical-geography/E14000929
Unit ID24896
Unit ID URLhttp://data.ordnancesurvey.co.uk/id/7000000000024896

However, unlike the Linked Data/SPARQL interfaces, most CSV or Excel files that are currently published (such as those produced by the Office for National Statistics as a result of census analysis) don’t define the data type of each column using URLs. Although there’s usually a property name in the first row, there’s rarely a datapackage.json file defining a basic data type (number, string, date, etc), and practically never a JSON-LD context file to map those names to URLs.

Given an index of CSV files, like those in CKAN-based stores such as data.gov.uk, how can we identify those which contain either unit IDs or GSS codes?

As Thomas Levine's commasearch project demonstrated at csvconf last year, if you have a list of all (or even just some) of the known members of a collection of typed entities (e.g. a list of all the countries in the world), it’s easy enough to find other datasets that contain them: as long as at least a certain proportion of the distinct values of a column match those in the known collection, the data type can be guessed, and can be assigned a URL.


TODO: compile lists of values for known data types, particularly from Wikidata. For example: country names (a list of names that changes slowly), members of parliament (a list of names that changes regularly), years (a range of numbers that grows gradually), gene identifiers (a list of strings that grows over time), postcodes (a list of known values, or values matching a regular expression).

Related tools

Footnotes

* On the downside, parliamentary constituencies can be changed (currently every 5 years), as population density shifts around the country and the current government decides to increase or decrease the number of constituencies. This makes it difficult to use the constituencies for long-term comparisons.