Search code examples
javapdfitextpdfbox

Apache PDFBox not able to extract text, but PDFBox Debugger showing the text correctly?


I am not able to extract text from a pdf, it is showing garbage characters, but when I checked with the debugger, it was showing the correct text.

Here is the pdf link.

Let's say I am trying to extract page 9 using PDFTextStripper, I get output like following:

hhhhhhhhhhhhhhhhhhhhhhhIhhIhhhhhhhhhIh]hhhhhIhhIIhhIIIhhhh...

I also get same results when try to copy using Adobe Reader, or extract using other libraries Itext, MuPdf.

However, when checked by PDFBox Debugger, I see the text is correctly visible as follows (highlighted red):

enter image description here

I don't have knowledge of the PDF stream syntax above, but if the PDFBox Debugger is showing correct text, why can't PDFBox able to extract the text correctly?

Also, the font used is an old Gurmukhi script font, in which the glyphs are drawn in the ASCII range mostly, hence the text is English letters.


Solution

  • The issue was incorrect char code mapping to unicode by the font...

    However, using the char codes directly in processTextPosition is working for me:

    @Override
    protected void processTextPosition(TextPosition text)
    {
        super.processTextPosition(text);
    
        // following gives incorrect mapping
        // String unicode = text.getUnicode();
    
        int[] cc = text.getCharacterCodes();
        String unicode = new String(cc, 0, cc.length);
    }
    

    See official example for more on usage of processTextPosition.