Search code examples
pdftexttextboxselection

What governs the text selection order of PDFs, how can it be improved when generating PDFs?


A number of PDFs, particularly those exported by presentation software, desktop publishing or latex typesetting seem to have an illogical text selection marquee order.

For example selecting parts of a math equation in one of my documents seems to randomly select another large block of equations elsewhere on the page, even though they are separated by body text. Is this a problem in the PDF viewer(mac preview) or in the PDF file itself. What procedures should be followed when programmatically generating PDFs to insure a logical ordering for textual selection.


Solution

  • Text selection in PDF viewers is determined by an algorithm in the viewer. Different viewers will have different algorithms and yield different results. Some viewers will leverage the structure tags if they are present, others will ignore the tags even when present.

    Unfortunately, there is nothing you can do as the PDF author to influence how any particular viewer software interprets the text rendering instructions into words then into blocks of text into page regions and finally into a text selection.