java android-studio pdfbox text-extraction

PDFBox text extraction ligatures "fi", "fl" problem on Android Studio

I'm using this https://github.com/TomRoush/PdfBox-Android PDFBox on Android Studio library to extract text from a PDF document. Here's what I'm doing:

File pdf_file = new File(file_path);

to create the file, then

PDDocument document = null;
document = PDDocument.load(pdf_file);

to load the file into a PDDocument object, and then

PDFTextStripper pdfStripper = new PDFTextStripper();
pdfStripper.setStartPage(...);
pdfStripper.setEndPage(...);
String page_text = pdfStripper.getText(document);

to get the text content of the page. The issue is that when there's for example the word "firm" it displays it like "fi rm". It basically puts a space after fi (and I guess fls and other ligatures). I tried reading this Problems with extracting OpenTypeFont text using pdfBox but I don't understand how to fix it. There are no solution details.

Important: As it turns out, in my PDF file, I don't have any ligatures such as ﬁ but I have regular fi and yet, there's space after it. A solution is unclear.

PDF file: https://wetransfer.com/downloads/09e9036dda4a7962ccad32b1cbcd8edc20200506050349/ab4752

Solution

The issue is that when there's for example the word "firm" it displays it like "fi rm".

The reason is simple: There is a space after the "fi"!

This is the text drawing instruction drawing the line with the first occurrence of "firm" in your sample file:

 [( )360.3(Mr Dursley was the director of a “)250( )110.3(rm called Grunnings, )]TJ

The byte “ (147) by means of the font encoding is mapped to the glyph name fi and by means of the ToUnicode map of the font to the Unicode character U+fb01, the Latin small ligature fi.

Thus, PDF viewers display the ligature glyph ﬁ and text extractors extract either the Unicode ligature character ﬁ or after expansion the characters f and i.

After that ligature the start point for drawing the next glyph is moved left by 250 units, then a space is drawn, then the next start point is moved left by 110.3 units, and then "rm" is drawn.

Thus, you don't see a gap between "fi" and "rm" in viewers (because the moves left counteract the drawing of the space glyph) but text extractors extract a space character (because it's there).

You can check that this is not a PDFBox quirk, e.g. Adobe Reader with copy&paste extracts that text line as

Mr Dursley was the director of a fi rm called Grunnings,

Just like PDFBox it expands the ligature and extracts the space character.