I am extracting text from a PDF and have an issue with the same text being returned from sequential pages. I have written a few PDF parsers using iTextSharper and have just ported the following code from iTextSharper to iText7 on the flawed assumption this was only an iTextSharper issue:
var pdfDocument = new PdfDocument(new PdfReader(@"C:\Temp\MyForm.pdf"));
for (int page = 1; page <= pdfDocument.GetNumberOfPages(); page++)
{
var strategy = new SimpleTextExtractionStrategy();
var pdfPage = pdfDocument.GetPage(page);
var currentText = PdfTextExtractor.GetTextFromPage(pdfPage, strategy);
// Process this page
Console.WriteLine("PAGE {0}", page);
Console.WriteLine(currentText);
}
Is there something I'm missing here?
Actually it is not the same text being returned from sequential pages. Instead you get
Often this happens for code that re-uses a text extraction strategy for multiple pages. But that's not the case in your code, you correctly create a new strategy object for each page. Thus the cause must be in the PDF itself.
And indeed, each page of your document does contain the contents of all previous pages, too, merely outside its crop box. To extract only the text in the respective page crop box you have to filter, e.g. like this:
string SRC = @"285187.pdf";
PdfDocument pdfDoc = new PdfDocument(new PdfReader(SRC));
Console.WriteLine("\n285187 Filtered\n============\n");
for (int i = 1; i <= pdfDoc.GetNumberOfPages(); i++)
{
var strategy = new SimpleTextExtractionStrategy();
var pdfPage = pdfDoc.GetPage(i);
var filter = new IEventFilter[1];
filter[0] = new TextRegionEventFilter(pdfPage.GetCropBox());
var filteredTextEventListener = new FilteredTextEventListener(strategy, filter);
var currentText = PdfTextExtractor.GetTextFromPage(pdfPage, filteredTextEventListener);
Console.WriteLine("PAGE {0}", i);
Console.WriteLine(currentText);
}
pdfDoc.Close();
It is unclear whether the PDF has been created like this by design or by error.