Search code examples
pdfitextpdf-reader

iText throws ClassCastException: PdfNumber cannot be cast to PdfLiteral


I am using iText v5.5.1 to read PDF and render paint text from it:

pdfReader = new PdfReader(new CloseShieldInputStream(is));
pdfParser = new PdfReaderContentParser(pdfReader);

int maxPageNumber = pdfReader.getNumberOfPages();
int pageNumber = 1;

StringBuilder sb = new StringBuilder();

SimpleTextExtractionStrategy extractionStrategy = new SimpleTextExtractionStrategy();

while (pageNumber <= maxPageNumber) {
    pdfParser.processContent(pageNumber, extractionStrategy);

    sb.append(extractionStrategy.getText());

    pageNumber++;
}

On one PDF file the following exception is thrown:

java.lang.ClassCastException: com.itextpdf.text.pdf.PdfNumber cannot be cast to com.itextpdf.text.pdf.PdfLiteral
    at com.itextpdf.text.pdf.parser.PdfContentStreamProcessor.processContent(PdfContentStreamProcessor.java:382)
    at com.itextpdf.text.pdf.parser.PdfReaderContentParser.processContent(PdfReaderContentParser.java:80)

That PDF file seems to be broken, but maybe its contents still makes sense...


Solution

  • Indeed

    That PDF file seems to be broken

    The content streams of all pages look like this:

    /GS1 gs
    q
    595.00 0 0 
    

    It looks like they all are cut off early as the last line is not a complete operation. This certainly can make a parser hickup as iText does.

    Furthermore the content should be longer because even the size of their compressed stream is a bit larger than the length of this. This indicates streams broken on the byte level.

    Looking at the bytes of the PDF file one cannot help but notice that

    1. even inside binary streams the codes 13 and 10 only occur together and
    2. cross-reference offset values are less than the actual positions.

    So I assume that this PDF has been transmitted using a transport method handling it as textual data, especially replacing any kind of assumed line break (CR or LF or CR LF) with the CR LF now omnipresent in the file (CR = Carriage Return = 13; LF = Line Feed = 10). Such replacements will automatically break any compressed data stream like the content streams in your file.

    Unfortunately, though...

    but maybe its contents still makes sense

    Not much. There is one big image associated to each page respectively. Considering the small size of the content streams and the large image size I would assume that the PDF only contains scanned pages. But the images also are broken due to the replacements mentioned above.