Search code examples
pdfpdfboxformatted

How to extract formatted text corectly in PDFBOX like "copy with formatting "in adobe acrobat x pro


i need to extract text correctly in pdf file (first line in page 1 https://github.com/zhongguogu/PDFBOX/blob/master/pdf/formatted_text.pdf) with PDFBOX.

Actually the result is
but i can copy text correctly by "Copy Withing Formatting" in Adobe Acrobat X Pro. 江苏利士德化工有限公司.

Is there any method to extract formatted text correctly.


Solution

  • You can consider PDFBox's in-built text extraction capabilities to be akin to regular copy&paste from Adobe Acrobat Reader; there are some differences in detail, e.g. Adobe Reader preferring ActualText tags over regular text extraction of the tagged content, but mostly they are the same and implement text extraction as described in the PDF specification ISO 32000-1 / ISO 32000-2.

    Adobe Reader copy&paste from your document results in something which to me looks like a parade of squares. This is approximately what you can expect from PDFBox, too.

    Looking into the internals of your sample PDF one finds no information on which Unicode code point corresponds to any given glyph, at least none encoded in a standard way.

    What I assume Adobe's "Copy With Formatting" does, is extraction either based on heuristics resulting from prior in-depth analysis of proprietary outputs by common PDF creators or based on comparison of glyph definitions with glyphs present in any accessible fonts. Probably even a combination of both, backed by OCR if all else fails.

    You can implement something similar using PDFBox as a framebox to retrieve the raw data but don't expect this task to be easy.