Scraping web pages with PHP 5


$html = new DOMDocument();
@$html->loadHTMLFile($url); // fetch the remote HTML file and parse it (@ suppresses warnings).
$xml = simplexml_import_dom($html); // convert the DOM object to a SimpleXML object.
foreach ($xml->xpath('//a') as $node){ // run an XPath query and iterate through the array of results
  print (string) $node . "\n"; // casting to string produces the text contents of the node.
  print $node['href'] . "\n"; // attributes of the node are accessible as array attributes.
  print $node->asXML() . "\n\n"; // asXML() produces the whole XML string.
}

Note: if namespaces are involved, use

$xml->registerXPathNamespace('NAMESPACE_PREFIX', 'NAMESPACE_URI');
and
$xml->xpath('//NAMESPACE_PREFIX:ELEMENT')
replacing the text in capitals as appropriate.

Comments

Thanks, this was useful. I was just looking at scraping with CURL, but this seems better . . .

Posted by: Andrew on November 9, 2007 2:22 PM

I always find:
1. Fetch
2. Run through html tidy
3. Parse with simplexml
4. xpath fun

Works a treat, and can fix some... messy... pages.

You could run it through Tidy - I used to - but I find it easier to use PHP's built-in HTML parser instead.

Easy enough with DOMDocument, isn't it? I remember a while ago trying to scrape it all manually. No chance!

For me, SimpleXML is significantly easier to work with than DOMDocument (as long as you don't want to do anything too complicated).

All fields are optional, email address will not be shown; no HTML, URLs are automatically hyperlinked.