Search code examples
javaencodingitextright-to-left

Reversed Hebrew or numbers after using iText to parse a PDF document


I'm working with iText5 to parse a pdf written mostly in Hebrew.
To extract the text I use PdfTextExtractor.getTextFromPage. I didn't find a way to change the encoding in the library and the text appears in ​gibberish.

I tried to fix the encoding like this:
new String(pdfPage.getBytes(Charset1), Charset2).
I went through all possible charsets using Charset.availableCharsets() and few of them gave me Hebrew instead of gibberish but reversed.

Now I thought I can reverse the text line by line, but Hebrew it right to left and number and English are left to right. So if I reverse the line, it fixes the Hebrew but breaks the numbers/English.

Example:

PdfTextExtractor.getTextFromPage returns 87.55 úåáééçúä ééåëéð ë"äñ

new String(text.getBytes(Charset.forName("ISO-8859-1")), Charset.forName("windows-1255")) returns 87.55 תובייחתה ייוכינ כ"הס

if I reverse this then I get סה"כ ניכויי התחייבות 55.78​ ​

The number should be 87.55 and not 55.78

The only solution I found is to split it into Hebrew and the rest (English/numbers) and reverse only the Hebrew parts and then merge it back.

Isn't there an easier solution? I feel like I'm missing something with the encoding/RTL


Solution

  • Using ICU did the job:

    Bidi bidi = new Bidi();
    bidi.setPara(input, Bidi.RTL, null);
    String output = bidi.writeReordered(Bidi.DO_MIRRORING);