Mapping XML Named Character References to Unicode Characters

Characters & Code Points

Character sets provide mappings between numeric code points and the semantics of characters at each code point.

ISO 10646 (1990): Universal Character Set/UCS.
Unicode 1.0 (1991).
Unicode 2.0 (1996).
Unicode 5.2 (2009).
Unicode 2.0 (1996) exactly matches the characters/code points defined in ISO 10646-1 (UCS, 1993), and since then their development has stayed aligned.

Mapping named entity references to characters

Markup languages (e.g. DocBook, TEI, MathML, (X)HTML) define mappings between named character entity references (e.g. π) and Unicode code points.

Named character reference	Numeric character reference (hex)	Numeric character reference (dec)	Unicode character
π	π	π	π

Entity sets

ISO SGML standards (1986 - 1991)
- ISO 9573-13 (1991): SGML public entity sets for mathematics and science.
These sets define standard lists of named characters, but don't provide mappings to Unicode code points. The original SGML entity sets were defined via SDATA entities, and allowed character entity names to be defined without mapping the character to any particular encoding: the processing of the entities was specified in a system-specific manner for any system processing the SGML. XML does not support SDATA entities, so for XML it is necessary to map the entity names to Unicode characters^[source].
HTML4 (1997) and XHTML 1.0 (2000)

The HTML4 and XHTML 1.0 DTDs define around 250 named characters, mostly derived from the ISO standards above, and map these named characters to Unicode code points: xhtml-lat1, xhtml-symbol and xhtml-special.

MathML2 (2001)

Mathematical characters were added to Unicode 3.2 (2002) and 4.0 (2003), so MathML2 character references could then be mapped to Unicode. There is a section of the MathML2 specification that deals with Characters.

# generate HTML, ISO8879, ISO9573-13 and MathML entity files (they're written to ../DTD/mathml2) for use with the MathML2 DTD:
xsltproc http://www.w3.org/Math/characters/entities.xsl http://www.w3.org/Math/characters/unicode.xml

W3C XML

The W3C has produced a standardised set of mappings between named character references and Unicode code points.
w3centities-f.ent contains all the characters from the ISO standards, deduplicated. There are also XSLT2 stylesheets for reverse mapping (converting Unicode characters to named character references).
The source files are available for this standard (see all files).
```
# generate all the entity files (they're written to ../2007), using the open-source version of Saxon (it needs an XSLT2 processor):
java net.sf.saxon.Transform -s:http://www.w3.org/2003/entities/2007xml/unicode.xml -xsl:http://www.w3.org/2003/entities/2007xml/entities.xsl
```
HTML5

HTML5 uses the W3C mappings, above. A table of mappings between character reference names and UTF-8 code points is available for HTML5.
MathML3

MathML3 uses the W3C mappings, above.
DocBook5

DocBook 5 recommends using the W3C mappings, above.

The W3C's "Entity Definitions for Characters" specification is in Last Call Working Draft status, and the deadline for reviews was last week. Hopefully it will become an official recommendation soon.

Characters & Code Points

Mapping named entity references to characters

Entity sets

ISO SGML standards (1986 - 1991)

HTML4 (1997) and XHTML 1.0 (2000)

MathML2 (2001)

W3C XML

HTML5

MathML3

DocBook5