Libxml2, PHP and UTF-8


This isn't a new problem, but also not clearly stated anywhere that I can find, so seems worth noting:

When reading an XML string with SimpleXML, the default encoding is apparently not UTF-8—as I'd assumed—but the system's default encoding. When reading in a string without a defined encoding and writing out DOM elements it all works as expected, as UTF-8, but when the whole document is written out any UTF-8 characters that can't be represented in the default encoding get converted to numeric entities.

$doc = simplexml_load_string('<text>umlaut ü here</text>');
print $doc->asXML() . "\n";

<text>umlaut &#xFC; here</text>.

To avoid this, explicitly declare the encoding at the start of the XML string:

$doc = simplexml_load_string('<?xml version="1.0" encoding="UTF-8"?><text>umlaut ü here</text>');
print $doc->asXML() . "\n";

<text>umlaut ü here</text>.

There doesn't seem to be anywhere to set the default encoding within PHP, and the libxml2 'encodings' documentation seems to suggest that UTF-8 should be the default.