OpenCalais API

·

Thomson Reuters provides OpenCalais, which extracts all the entities it can find from a chunk of HTML or plain text and returns them as RDF/XML, or a simple list of entities.

Here's some PHP 5 code that will use the OpenCalais API to extract entities from a chunk of HTML:

$data = array(
  'licenseID' => 'YOUR_API_KEY', // put your license key here, from: http://www.opencalais.com/user/register
  'content' => '<p>Steve Jobs announced a new iPhone in San Francisco.</p>', // set your HTML content here
  'paramsXML' => '<c:params xmlns:c="http://s.opencalais.com/1/pred/">
    <c:processingDirectives c:contentType="text/html" c:outputFormat="Text/Simple"/>
    <c:userDirectives c:allowDistribution="false" c:allowSearch="false"/>
    </c:params>', // bug in Calais - doesn't take default namespace http://opencalais.com/node/296
  );
$response = file_get_contents(
  'http://api.opencalais.com/enlighten/calais.asmx/Enlighten', NULL,
  stream_context_create(array('http' => array('method' => 'POST', 'content' => http_build_query($data, '', '&'))))
  );
// response is two layers of XML, for some reason
$text = simplexml_load_string($response);
$xml = simplexml_load_string($text[0]);
foreach ((array) $xml->CalaisSimpleOutputFormat as $type => $terms){
  if (!is_array($terms))
    $terms = array($terms);
  foreach ($terms as $term)
    printf("%s: %s\n", $type, $term);    
}