2006-06-28 Data Webs Conference

An interesting (and well organised) Data Webs conference was held yesterday at Imperial College, London. The main theme was that scientists need to publish more of their data and, as this data will be in distributed databases (including those previously known as journals), there is a need for effective mechanisms for querying and harvesting metadata from those databases. The solutions appear to be (probably) SPARQL, OAI-PMH and central ping servers/metadata registries, or something along those lines.

The presentations should all be online soon, but if anyone is interested here are my—brief—notes (comments in square brackets).

  1. David Shotton
    • Lightweight semantic web
    • Need to publish all research data
    • Need descriptive metadata
    • 'Universal truth' data (collected once and made public): EMBL, UniProt, PDB, genome databases
    • 'Particular' data: distributed individual databases
    • 'dataspace'
    • central 'data marshalling' service - harvests and stores metadata from distributed databases
  2. Nicholas Gibbins
  3. Andy Seabourne
    • SPARQL: query languag , protocol, XML results format (HTTP + SOAP)
    • USe SPARQL to query original data stores and end up with RDF for distribution
    • SQL mapping: SquirrelRDF, D2RQ
    • Need to provide ontologies to match SQL columns to the outside world and describe the information
    • Federated query: service description, information directory
    • Jena framework does RDF, SPARQL, OWL, rules
    • Cost-based query optimisation - can refuse difficult requests
    • Lots of databases implementing SPARQL (see W3C site)
  4. David Karger
    • Ontologies, scheme annotations, rules, inference systems
    • OR
    • 'semimantic web': items with URLs, joined by named relations, but no semantics
    • Applications currently built one-size-fits-all
    • Piggy Bank/Longwell/lenses/Fresnel/Haystack
    • Don't need to agree on ontologies - just do it!
    • RDF, eg scientist's list of publications on home page
    • [didn't mention microformats]
  5. John Helliwell
    • checkcif.iucr.org
    • structured data as supplement to published paper
    • not freely distributable (copyright retained by the publisher)
  6. Philip Bourne
    • BioLit toolkit
    • PLoS open data (requires attribution still)
    • Topaz = manuscript and content management system - also to be open source
    • PDB: DOI for every structure
    • Previous authoring tool prototype: BioEditor
    • Making an authoring plugin for MS Word [so are the NCBI]
    • mbt.sdsc.edu (display content)
  7. Anita de Waard
    • Structuring papers to allow linking between arguments and the data that they're based on [needs URIs + fragment identifiers]
    • Storytelling/computer-assisted understanding of causality/relations between statements/proof
  8. Ben Lund
    • Connotea
    • Autodiscovery of metadata: RSS/Atom feeds, RIS, OTMI, embedded RDF (RDF/XML, eRDF, RDF/A), citation microformat.
  9. Peter Mika
    • Flink - semanticweb.org
    • www.openrdf.org (open source)
    • scientometrics
    • Making use of disparate sources of data
    • Need liberal parsers, especially for RDF
    • openacademia
  10. David Shotton
    • OGSA-DAI
    • Data published on individual's own servers
    • Metadata harvested -> central registry
    • BioImageWeb - data web for *published* biological images
    • Supported by publishers
    • Make ontology using extended DC