Tagged PDF

A PDF contains a collection of objects, some of which contain instructions for drawing things on a page.

A screen reader can see these pieces of text, but it doesn’t know which order they should be read in, or which pieces of text are headings, lists, tables, emphasis, links, code, etc.

PDF accessibility standards

The PDF standard thus specifies a Structure Tree, a tree of roles (Sect, P, L, Code, etc) with pointers to the objects containing the text ("marked content") within them.

The PDF standard also specifies a RoleMap, which maps custom roles to standard roles, applied recursively.

There is also a ClassMap, which maps class names to attributes (e.g. styles) that should be applied to elements.

The PDF 2.0 standard, along with PDF/UA-2 ("Universal Access") which builds on it (as does PDF/A-4, the "Archiving" standard), extend the PDF 1.7 and PDF/UA-1 standards by adding more roles and associating MathML with each equation in the PDF.

Deriving HTML from PDF

Given all that, it’s possible to derive semantic HTML from a well-tagged PDF, and there is a defined algorithm with a reference implementation for doing that.

I've made a tagged-pdf-to-html TypeScript library with the aim of applying this algorithm to tagged PDFs, and a demo which presents the derived HTML next to the source PDF.

Dual Lab produces several related products, created using Java, including ngPDF which implements the HTML derivation algorithm.

Creating tagged PDF

Given the amount of documents published as PDF, and the regulations which now require institutional publications to be accessible, it makes sense for the tools which create the PDFs to make use of the structure already defined in the original markup language (e.g. LaTeX or Typst).

The LaTeX Tagged PDF Project has spent several years working on adding support for tagging to LaTeX engines (particularly lualatex), packages and classes. This work is getting close to being complete, though still in experimental packages until it’s ready to merge into the main LaTeX kernel.

Typst already creates tagged PDFs by default (with support for PDF/UA-2 planned for the future), and encourages the use of semantic markup.