PHP, DOM, DTDs and named entities

·

Input

When an XML document is loaded using DOMDocument::load/DOMDocument::loadXML, there are several libxml options that affect how the document is processed. Here are some of the most useful:

OptionDescription
LIBXML_DTDLOADLoad the DTD for this XML file, as specified in the DOCTYPE declaration and possibly located via /etc/xml/catalog
LIBXML_NOENTReplace named character entities with their appropriate characters that are defined in the DTD
LIBXML_NOCDATAConvert CDATA blocks into text nodes
LIBXML_DTDATTRAdd default attributes specified in the DTD if they're missing from XML elements
LIBXML_DTDVALIDValidate the XML document against the DTD

There are also related PHP DOMDocument properties that can be set, but it's best to use the libxml options above to have exact control over what happens:

PropertyEquivalent
$dom->resolveExternalsLIBXML_DTDLOAD | LIBXML_DTDATTR
$dom->substituteEntitiesLIBXML_NOENT
$dom->validateOnParseLIBXML_DTDLOAD | LIBXML_DTDVALID
$dom->preserveWhiteSpaceNone (keep redundant white space)

The function DOMDocument::validate() can be used instead of setting LIBXML_DTDVALID or $dom->validateOnParse, to validate the document after it has been parsed.

Output

There's one libxml option that can be used with DOMDocument::save/DOMDocument::saveXML to affect the output XML:

OptionDescription
LIBXML_NOEMPTYTAGExpand self-closing empty tags

There is also one DOMDocument property that can be set before using DOMDocument::save/DOMDocument::saveXML to output XML:

PropertyDescription
$dom->formatOutputIndent and format the output

Note that — if the document contains white space between elements — formatOutput has no effect on the output unless preserveWhiteSpace is set to FALSE before loading the DOMDocument.

Example code

First, a DTD file, saved as example.dtd:

<!-- define the entity "omegachar" -->
<!ENTITY omegachar "Ω">
<!-- set a default "title" attribute for "div" elements -->
<!ATTLIST div 
  title CDATA "default title">
<!-- define allowable elements and their contents -->
<!ELEMENT body (div+)>
<!ELEMENT div (p*, br*)>
<!ELEMENT p (#PCDATA)>
<!ELEMENT br (#PCDATA)>

Then some example XML that references the DTD:

$xml = '<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE body SYSTEM "example.dtd">
<body><div><p>Ω</p>
<p>&omegachar;</p><p><![CDATA[<test>]]></p><br/></div></body>';

And some code to create and output DOM documents:

<?php
// define the libxml options
$options = LIBXML_DTDLOAD | LIBXML_NOENT | LIBXML_DTDVALID | LIBXML_NOCDATA;
$dom = new DOMDocument();
$dom->loadXML($xml, $options); // load using libxml options
print 'No default attributes and unformatted output; named entity converted:' . "\n";
print $dom->saveXML($dom) . "\n";
$dom = new DOMDocument();
$dom->preserveWhiteSpace = FALSE;
$dom->loadXML($xml, $options); // load using libxml options
$dom->formatOutput = TRUE;
print 'Formatted output and no empty tags:' . "\n";
print $dom->saveXML($dom, LIBXML_NOEMPTYTAG) . "\n"; 
// load with DOMDocument properties instead of libxml options
$dom = new DOMDocument();
$dom->resolveExternals = TRUE;
$dom->substituteEntities = TRUE;
$dom->loadXML($xml);
print 'Default attributes added due to resolveExternals; CDATA nodes unchanged:' . "\n";
print $dom->saveXML($dom) . "\n";

which produces this:

No default attributes and unformatted output; named entity converted:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE body SYSTEM "example.dtd">
<body><div><p>Ω</p>
<p>Ω</p><p>&lt;test&gt;</p><br/></div></body>
Formatted output and no empty tags:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE body SYSTEM "example.dtd">
<body>
  <div>
    <p>Ω</p>
    <p>Ω</p>
    <p>&lt;test&gt;</p>
    <br></br>
  </div>
</body>
Default attributes added due to resolveExternals; CDATA nodes unchanged:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE body SYSTEM "example.dtd">
<body><div title="default title"><p>Ω</p>
<p>Ω</p><p><![CDATA[<test>]]></p><br/></div></body>