Search code examples
pdftextwhitespacepdf-viewerpdf-specification

How Adobe Acrobat does break words in PDF documents when copying text?


PDF documents don't require space characters to be present in the page content streams to visually break words. As a consequence, a glyph for the space character may be missing as well in font programs. PDF compliant viewers appear to use font metrics and text state to infer an appropriate word spacing width and check it against characters positioning to add missing spaces when selecting/copying text. Unfortunately the PDF specification appears to not stress enough how word spacing width can be computed in such cases. While pdf.js appears to hard code a size for tracking word breaks, from my empirical tests it seems a different approach is used by Acrobat Reader/Pro. What it could be such heuristic?


Solution

  • The question is very technical and answering it requires either having some insider knowledge of Adobe Acrobat internals or having implemented text extraction in PDF documents with a robust set of test cases that were compared against Adobe results. To whom it may concern, assuming a robust words break algorithm for text extraction can be implemented by inferring an arbitrary spacing width and comparing against glyphs location, the heuristic I'm currently testing is the following:

    unscaledSpacingWidth = (average of non zero glyph widths obtained from /W or /Widths arrays) / 7

    Where 7 is an arbitrary constant which seems to work well and match Adobe Acrobat results close enough in a limited set of samples I tested. This compares against the solution in pdf.js which is just picking an hard-coded value of 0.1 PDF points.

    The found spacing width is subjected to scaling according to font size and other text state context.