i'm trying to extract text from a rectangle with ItextSharp, and it works fine with almost all the sections inside the document, except for some specific areas. These areas are simple bold caps titles and simple content with a slighter small font than the rest of the doc (both uppercase). In these areas i get an anagram of the selected text instead of the correct words.
For example the word "RELEASE" is ridden as "ERLEASE", "VOYAGE" becomes "EGAYVO", the sentence "FURTHER CHARGES" becomes "FHTRU E R CHAGR E S"
The odd thing is that if i try to the full page with a SimpleTextExtractionStrategy
, i obtain the correct text.
The pdf's font is classic Arial and the strategy i used for the extraction is taken from StackOverflow (rect it's passed by args):
_pdfRd = New PdfReader(_pdfPath)
Dim output As String()
Dim nrPag as Integer = 1
Dim filter As RenderFilter = New RegionTextRenderFilter(rect)
Dim strategy As FilteredRenderListener
Dim locStrategy As New LocationTextExtractionStrategy
strategy = New FilteredTextRenderListener(locStrategy, {filter})
output = GetTextFromPage(_pdfRd, nrPag, strategy).Split(vbLf)
_pdfRd.Close()
I tryed with other documents and it works very well, i'm not able to reproduce this issue with different documents.
I'm worried about my code and i tryed this strategy too: http://www.schiffhauer.com/read-text-in-a-pdf-in-c-with-itextsharp/ but the result it's the same.
I'm missing something in the read process or it's a problem related to my pdf?
UPDATE: If i select a single letter of a faulty word, the output is empty string, this also happens if i select more letters together, i obtain a (anagram) output only if i select the whole word. It's really odd, for example i noticed if i have the words "CARGO RELEASE", and i select with a rectangle only "GO" or any other substr i get nothing, but if i select "CARGO" i obtain "GRACO ERLESAE" and i haven't selected the second word, only the first one.
Have you tried to customize the working SimpleTextExtractionStrategy
, in a way that it takes not the full page but the rectangle?
You can find the full code in the ghitub project here: https://github.com/itext/itextsharp/blob/75f05dd7d87797b86c44649f5f96df2d90d730e8/src/extras/itextsharp.tests/iTextSharp/text/pdf/parser/SimpleTextExtractionStrategyTest.cs