COUNT/DISTINCT queries

The kind of query you'd want to use for faceted browsing through articles in a journal, say.

MySQL:

SELECT `year`, COUNT(*) as c FROM `articles` WHERE `issn` = '1234-5678' GROUP BY `year` ORDER BY `year` ASC

XQuery (MarkLogic):

let $docs := doc()[/article/front/journal-meta/issn="1234-5678"]
for $year in fn:distinct-values($docs/article/front/article-meta/pub-date/year)
  let $count := count($docs[article/front/article-meta/pub-date/year = $year]) 
  order by $year ascending
  return
    <year c={$count}>{$year}</year>

(Note that this is really slow on large collections. Ideally you'd create a separate namespaced block for the issn and year metadata, and use that to build a lexicon for fast queries).

SPARQL:

FAIL
      
    

Comments

Unless your dealing with permissions or some other filter you can replace the count with xdmp:estimate for a very very large speed increase.

Posted by: Gavin Carothers on August 5, 2008 4:37 PM

I think you are making a generalization about SPARQL that isn't quite accurate relative to XQuery.

Please note that there are SPARQL implemenations that include aggregate functions capability.

Example re. DBpedia which is Virtuoso [1] based, you can execute the following examples using the DBpedia SPARQL endpoint at: http://dbpedia.org/sparql :

Example 1:
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT count (*) WHERE { ?x foaf:depiction ?y . }

Example 2:
SELECT count (*) WHERE { ?x <http://dbpedia.org/property/abstract> ?y . }


More..

Links:

1. http://virtuoso.openlinksw.com (commercial edition)
2. http://virtuoso.openlinksw.com/wiki/main/ (open source edition)

Note that your XQuery implementation might be clever enough to discover that this is essentially a GROUP BY style query and re-order the querying such that it only has to scan documents exactly once.

All fields are optional, email address will not be shown; no HTML, URLs are automatically hyperlinked.