Search code examples
c#pdfitextitext7pdf-extraction

Extracting text using iText7 throws exception


Extracting text from PDF file using iText7 8.0.4

MemoryStream pdfStream = ...
  pdfStream.Position = 0;
  var strategy = new LocationTextExtractionStrategy();
  var reader = new PdfReader(pdfStream);
  using var pdfDocument = new PdfDocument(reader);
  for (int i = 1; i <= pdfDocument.GetNumberOfPages(); ++i)
  {
    var page = pdfDocument.GetPage(i);
    var text = PdfTextExtractor.GetTextFromPage(page, strategy);
  }

throws exception Error at file pointer 39747 :

     iText.IO.Exceptions.IOException: Error at file pointer 39747.  --->
     iText.IO.Exceptions.IOException: '>' not expected.   
 --- End of inner
     exception stack trace ---    at
     iText.IO.Source.PdfTokenizer.ThrowError(String error, Object[]
     messageParams)    at iText.IO.Source.PdfTokenizer.NextToken()    at
     iText.Kernel.Pdf.Canvas.Parser.Util.PdfCanvasParser.NextValidToken()  
     at iText.Kernel.Pdf.Canvas.Parser.Util.PdfCanvasParser.ReadObject()   
     at iText.Kernel.Pdf.Canvas.Parser.Util.PdfCanvasParser.Parse(IList`1
     ls)    at
     iText.Kernel.Pdf.Canvas.Parser.PdfCanvasProcessor.ProcessContent(Byte[]
     contentBytes, PdfResources resources)    at
     iText.Kernel.Pdf.Canvas.Parser.PdfCanvasProcessor.ProcessPageContent(PdfPage
     page)    at
     iText.Kernel.Pdf.Canvas.Parser.PdfTextExtractor.GetTextFromPage(PdfPage
     page, ITextExtractionStrategy strategy, IDictionary`2
     additionalContentOperators)    at
     iText.Kernel.Pdf.Canvas.Parser.PdfTextExtractor.GetTextFromPage(PdfPage
     page, ITextExtractionStrategy strategy)

How to extract text from this PDF ? PDF is in

https://wetransfer.com/downloads/0a2cb3abb921863f1f7aa109fd58e90e20240522072618/3e3501


Solution

  • This issue is caused by an inline image in the PDF page content stream. Its stream data contain the byte sequence of the EI end-of-inline-image marker, making iText prematurely consider this as the end of the inline image. It attempts to interpret the following stream data as content stream instructions which fails with the observed exception.

    There are two ways to deal with this, and both require some work.

    Improve the iText inline image parsing

    Beware, as this approach requires changes in private and/or sealed code, you either have to patch the iText sources and compile your own distribution, or you have to copy substantial parts of the iText content stream parsing framework into your namespaces, improve your copy, and use your code for stream parsing from now an.

    The main class of interest here is InlineImageParsingUtils in the iText.Kernel.Pdf.Canvas.Parser.Util namespace. There are two major code locations here to improve:

    1. The ParseSamples method recognizes the end of the inline image data primarily by finding "EI" followed by a white space. Before June 2018, though, it used to look for "EI" surrounded by white spaces.

      You may consider rolling back that change and again require the leading white space. The change has been introduced in the commit 0e44a96b2f3b90fb6656310d2c0f5615b05d4391 (which is the autoport from the iText/Java commit e0df12db9cd8869928bbcad7e038f3a5d1aef71c).

      Consider, though, that this commit has the title "Fix processing the end of an inline image." and refers to an issue DEVSIX-1914, so quite likely PDFs without a white space before the "EI" marker have been encountered by iText customers resulting in this change.

    2. There is an additional test of the found "EI", the method InlineImageStreamBytesAreComplete is called to check whether the identified stream appears complete, and if this method returns false, ParseSamples continues its search for the correct "EI".

      This method checks the stream by applying all declared filters (compression, binary-to-text, ...) and returns false if an exception occurred. Unfortunately, the filter implementations used are mostly built to be error tolerant.

      You may consider improving this by creating and applying less error tolerant filter implementations here to increase the likelihood that cut-off streams are recognized. iText already uses a stricter Flate filter here; you can similarly enforce the use of your alternative filter implementations. Furthermore, you can analyze the data you retrieve after applying the filters: Depending on the exact image type (also implied by the filters) you can validate whether the stream represents a complete image.

    Either of these improvements would have helped in case of your example PDF: The "EI" iText finds and assumes to be the end of the inline image is not preceded by a white space, and less error tolerant ASCII85Decode and LZWDecode implementations would have identified the incomplete data, as would have an analysis of the size of the the returned data.

    Sanitize content streams before parsing

    Alternatively, if you don't want to patch or copy the iText content stream parsing code, you can switch to a two-phase approach: In the first phase you can prepare the PDF document by loading all content streams, parsing them with your own code for inline image recognition, removing all the inline images you find, and storing these manipulated streams back into the document; in the second phase you use the regular iText parsing for text extraction.