UniProt / RDF / SPARQL

UniProt is a big curated database of proteins, for which all the data is available as RDF.

There are a few demonstration SPARQL interfaces to UniProt data:

  1. OpenLink Virtuoso here or here (but these currently seem to be missing UniProt...)
  2. Intellidimension's RDF Gateway

A selection of example SPARQL queries. Example queries for Bio2RDF.

The trouble is that these demonstration services impose limits on the complexity of queries, which are easily reached.

OpenLink have now made available an EC2 AMI of Virtuoso with the whole Bio2RDF project dataset, which includes UniProt, so you can run queries on your own private instance of Bio2RDF. The instructions basically involve installing a Virtuoso EC2 AMI instance and fetching the Bio2RDF data that's been provided in S3.

They're also apparently putting together an AMI for the whole freely-available linked data cloud.

I wanted to do a query like this, to select all the proteins that had been cited in Nature papers:

#PREFIX uni: <urn:lsid:uniprot.org:ontology:>
PREFIX uni: <http://bio2rdf.org/uniprot:>
PREFIX dc: <http://dublincore.org/2008/01/14/dcelements.rdf#>
    ?protein ?name ?date ?title ?identifier
        a uni:Protein ;
        uni:name ?name ;
        uni:citation ?citation .
        uni:name "Nature"^^xsd:string ;
        uni:date ?date ;
        uni:title ?title ;
        dc:identifier ?identifier .