Extracting Text From A PDF Using Only Javascript

·

Using an HTML page like this, which embeds a PDF-to-text extraction service I built using pdf.js, you can extract the text from a PDF using only client-side Javascript:


<!-- edit this; the PDF file must be on the same domain as this page -->
<iframe id="input" src="your-file.pdf"></iframe>

<!-- embed the pdftotext service as an iframe -->
<iframe id="processor" src="http://hubgit.github.com/2011/11/pdftotext/"></iframe>

<!-- a container for the output -->
<div id="output"></div>

<script>
var input = document.getElementById("input");
var processor = document.getElementById("processor");
var output = document.getElementById("output");

// listen for messages from the processor
window.addEventListener("message", function(event){
  if (event.source != processor.contentWindow) return;

  switch (event.data){
    // "ready" = the processor is ready, so fetch the PDF file
    case "ready":
      var xhr = new XMLHttpRequest;
      xhr.open('GET', input.getAttribute("src"), true);
      xhr.responseType = "arraybuffer";
      xhr.onload = function(event) {
        processor.contentWindow.postMessage(this.response, "*");
      };
      xhr.send();
    break;

    // anything else = the processor has returned the text of the PDF
    default:
      output.textContent = event.data.replace(/\s+/g, " ");
    break;
  }
}, true);
</script>

See an example running as a live demonstration.

It'll only work in recent browsers, as it requires sending binary data between windows as an ArrayBuffer using window.postMessage, and Web Workers in pdf.js.

Basically, this fetches a PDF as an ArrayBuffer using XMLHTTPRequest, then posts it to the embedded window, which uses pdf.js to render the PDF to Canvas (invisibly; you can see the rendered images if you poke around a bit with a web inspector tool). As it does so, an HTML layer is constructed, containing a block to match each row of the PDF - this would normally be overlaid on top of the rendered images to allow text to be selected, a technique used by many services that allow PDF text selection and highlighting, including Crocodoc and Google Docs' PDF viewer. By taking the text content of those blocks, the service can return the contents of the PDF as a single block of text.

I expect that pdf.js will acquire a native function for retrieving the text content directly, to make documents searchable. It would be nice, next, to try to recreate paragraphs by looking at the spacing between the blocks, and to use the formatting and other heuristics to extract metadata like title, authors, etc.