PHP, DOM and XML encodings

·

This is my understanding; I hope someone will correct anything that's wrong.

Loading an existing document

When reading in an existing XML document using DOMDocument::load or DOMDocument::loadXML, the encoding is specified by the XML declaration at the start of the file:

<?xml version="1.0" encoding="UTF-8"?>
If the encoding attribute is missing, the default is UTF-8, unless there's a Byte-Order-Mark (BOM) at the start of the file that indicates otherwise.
The PHP "encoding" and "xmlEncoding" properties of the DOM document are set to this value.

Creating a new document

When creating a new DOM document from scratch in PHP, the character encoding used for that document can be defined:

$dom = new DOMDocument('1.0', 'UTF-8');
(when reading in an existing XML document, these parameters are ignored).
Those parameters will be written to the XML declaration when the document is output later on.
The default is null, rather than UTF-8.
The PHP "encoding" and "xmlEncoding" properties of the DOM document are set to this value.

Output

When writing out an XML document using DOMDocument::save or DOMDocument::saveXML, if the "encoding" property of the DOM document is set (if an encoding was specified in the XML declaration of a document that was read, or was specified when creating a new DOM document, or was set later), that encoding will be used for the output document.

The default encoding, if none has been specified earlier, is that of the system/user environment: On UNIX, the LANG variable is responsible: en_GB = Latin-1, en_GB.UTF8 = UTF-8.
The default encoding seems to be Latin-1.
If no encoding was specified, no encoding attribute will be written to the output XML declaration.

Any Unicode characters in the XML document that aren't found in the character set of the output encoding will be represented as numeric entities. It doesn't make any difference if decimal (&#9832;) or hexadecimal (&#x2668;) numeric entities are used.

When writing out the contents of a DOM node with print $node->nodeValue, for example, the encoding will always be UTF-8, as that's what PHP's internal XML handling uses.

Here's a short example:

<?php
$dom = DOMDocument::loadXML('<?xml version="1.0" encoding="UTF-8"?><p>Ω</p>');
output($dom);
$dom = DOMDocument::loadXML('<p>Ω</p>');
output($dom);
$dom = DOMDocument::loadXML('<p>Ω</p>');
$dom->encoding = "UTF-8";
output($dom);
$dom = new DOMDocument();
$dom->appendChild($dom->createTextNode('Ω'));
output($dom);
$dom = new DOMDocument('1.0', 'UTF-8');
$dom->appendChild($dom->createTextNode('Ω'));
output($dom);
$dom = new DOMDocument('1.0', 'ISO-8859-1');
$dom->appendChild($dom->createTextNode('Ω'));
output($dom);
$dom = new DOMDocument('1.0', 'ISO-8859-1');
$dom->loadXML('<?xml version="1.0" encoding="UTF-8"?><p>Ω</p>');
output($dom);
function output($dom){
  print 'encoding: ' . $dom->encoding . "\n";
  print $dom->saveXML() . "\n";
}
which produces this:
encoding: UTF-8
<?xml version="1.0" encoding="UTF-8"?>
<p>Ω</p>
encoding:
<?xml version="1.0"?>
<p>&#x3A9;</p>
encoding: UTF-8
<?xml version="1.0" encoding="UTF-8"?>
<p>Ω</p>
encoding:
<?xml version="1.0"?>
&#x3A9;
encoding: UTF-8
<?xml version="1.0" encoding="UTF-8"?>
Ω
encoding: ISO-8859-1
<?xml version="1.0" encoding="ISO-8859-1"?>
&#937;
encoding: UTF-8
<?xml version="1.0" encoding="UTF-8"?>
<p>Ω</p>