SELECT * FROM WEB

Ramanathan V. Guha

Ramanathan V. Guha has been describing objects for a long time. He worked on Cyc's knowledge representation in the 80's and 90's; created Meta Content Framework (MCF) at Apple in 1995; worked with Tim Bray at Netscape to serialise MCF as XML in 1997; worked with Dan Libby to create RSS 0.90 (RDF serialised as XML) in 1999; co-edited the RDF Schema specification with Dan Brickley from 2000-2004; built RDF-based search system TAP with Rob McCool in 2002; worked at Google to create their Custom Search Engine (CSE) since 2005; and is chair, with Dan Brickley, of the W3C Web Schemas working group.

Schema.org and Microdata

The HTML5 specification includes Microdata - a simple way to mark up descriptions of objects and their properties in HTML, using two main attributes: itemscope and itemprop.

In 2011, Guha, Brickley, and others worked with the major search engines to launch schema.org, a single, flat ontology which provides URLs and names for all the classes of objects commonly published on the web, and their properties.

Microdata using the schema.org ontology is now widely deployed in HTML documents on the web, and Google can extract this information from crawled documents, index those resources, cluster the entities it harvests, store them in a quad store (eg Freebase), and build its Knowledge Graph: a set of objects, relationships between those objects, and the names of the objects and their properties in different languages.

Programmable Search Engine

Guha has spent the last 8 years at Google filing patents for a "programmable search engine". This programmable search engine is Google's Custom Search Engine (CSE), which allows anyone to build queries on their own subset of Google's search index, filtering which items should be included and choosing how the results should be displayed.

Structured search

The search index includes any structured data that is associated with the document. For example, Custom Search Engine indexes meta tags in the head of an HTML document. To demonstrate this, I created a custom search engine that allows searching for documents which contain a specific meta tag.

These searches work by adding more:pagemap:metatags-{tag} to the query, eg more:pagemap:metatags-citation_title, which limits the results to only those pages that contain <meta name="citation_title"> tags. I've noticed that it may sometimes show documents which don't contain the tag, when multiple documents containing the same entity have been clustered together, such a PubMed page for a journal article and the article on the publisher's site.

Searching for objects

This year, Google made all the schema.org classes and properties accessible to CSE. This means that it's now possible to search the web for objects of any supported type, objects that have certain properties, or objects with specific property values.

The first demonstration I saw of this was the Datasets search engine, which limits keyword searches to "Dataset" objects, and uses a custom template to render the results. I expanded (and simplified) the example to create a custom search engine that searches for keywords within any type of object from the schema.org hierarchy.

These searches work by adding more:pagemap:{class} to the query, eg more:pagemap:tvseries, which limits the results to only those pages that contain "TV Series" objects.

It's also possible to query for specific property values. For example, a query for the music album titled "Bushcraft" by the band "Baptists" uses this syntax: more:pagemap:musicrecording more:pagemap:musicgroup-name:baptists more:pagemap:musicalbum-name:bushcraft.

Query by example

What I particularly like about this search interface is that it provides "query by example". An object with partial metadata:

{
    type: "Article",
    author: "Minsky",
    name: "A framework for representing knowledge",
}

can be turned into a structured query:

more:pagemap:article
more:pagemap:article-author:minsky
more:pagemap:article-name:a*framework*for*representing*knowledge

which will return the URLs of pages containing a matching entity, along with the full "rich snippet" set of extracted metadata:

"article": {
    "name": "A Framework for Representing Knowledge",
    "author": "Marvin Minsky",
    "description": "It seems to me that the ingredients of most theories both in Artificial Intelligence and in Psychology have been on the whole too minute, local, and unstructured to account–either practically...",
    "url": "http://web.media.mit.edu/~minsky/papers/Frames/frames.html"
}

Custom range queries and ranking of results

The structured search syntax also allows filtering by property values within ranges, sorting results by property values, and much more.