Search code examples
javaandroid-studiopdfboxtext-extraction

PDFBox text extraction ligatures "fi", "fl" problem on Android Studio


I'm using this https://github.com/TomRoush/PdfBox-Android PDFBox on Android Studio library to extract text from a PDF document. Here's what I'm doing:

File pdf_file = new File(file_path);

to create the file, then

PDDocument document = null;
document = PDDocument.load(pdf_file);

to load the file into a PDDocument object, and then

PDFTextStripper pdfStripper = new PDFTextStripper();
pdfStripper.setStartPage(...);
pdfStripper.setEndPage(...);
String page_text = pdfStripper.getText(document);

to get the text content of the page. The issue is that when there's for example the word "firm" it displays it like "fi rm". It basically puts a space after fi (and I guess fls and other ligatures). I tried reading this Problems with extracting OpenTypeFont text using pdfBox but I don't understand how to fix it. There are no solution details.

Important: As it turns out, in my PDF file, I don't have any ligatures such as fi but I have regular fi and yet, there's space after it. A solution is unclear.

PDF file: https://wetransfer.com/downloads/09e9036dda4a7962ccad32b1cbcd8edc20200506050349/ab4752


Solution

  • The issue is that when there's for example the word "firm" it displays it like "fi rm".

    The reason is simple: There is a space after the "fi"!

    This is the text drawing instruction drawing the line with the first occurrence of "firm" in your sample file:

     [( )360.3(Mr Dursley was the director of a “)250( )110.3(rm called Grunnings, )]TJ
    

    The byte (147) by means of the font encoding is mapped to the glyph name fi and by means of the ToUnicode map of the font to the Unicode character U+fb01, the Latin small ligature fi.

    Thus, PDF viewers display the ligature glyph and text extractors extract either the Unicode ligature character or after expansion the characters f and i.

    After that ligature the start point for drawing the next glyph is moved left by 250 units, then a space is drawn, then the next start point is moved left by 110.3 units, and then "rm" is drawn.

    Thus, you don't see a gap between "fi" and "rm" in viewers (because the moves left counteract the drawing of the space glyph) but text extractors extract a space character (because it's there).

    You can check that this is not a PDFBox quirk, e.g. Adobe Reader with copy&paste extracts that text line as

    Mr Dursley was the director of a fi rm called Grunnings,
    

    Just like PDFBox it expands the ligature and extracts the space character.