Thomson Reuters provides OpenCalais, which extracts all the entities it can find from a chunk of HTML or plain text and returns them as RDF/XML, or a simple list of entities.
Here's some PHP 5 code that will use the OpenCalais API to extract entities from a chunk of HTML:
$data = array(
'licenseID' => 'YOUR_API_KEY', // put your license key here, from: http://www.opencalais.com/user/register
'content' => '<p>Steve Jobs announced a new iPhone in San Francisco.</p>', // set your HTML content here
'paramsXML' => '<c:params xmlns:c="http://s.opencalais.com/1/pred/">
<c:processingDirectives c:contentType="text/html" c:outputFormat="Text/Simple"/>
<c:userDirectives c:allowDistribution="false" c:allowSearch="false"/>
</c:params>', // bug in Calais - doesn't take default namespace http://opencalais.com/node/296
);
$response = file_get_contents(
'http://api.opencalais.com/enlighten/calais.asmx/Enlighten', NULL,
stream_context_create(array('http' => array('method' => 'POST', 'content' => http_build_query($data, '', '&'))))
);
// response is two layers of XML, for some reason
$text = simplexml_load_string($response);
$xml = simplexml_load_string($text[0]);
foreach ((array) $xml->CalaisSimpleOutputFormat as $type => $terms){
if (!is_array($terms))
$terms = array($terms);
foreach ($terms as $term)
printf("%s: %s\n", $type, $term);
}