Search code examples
pdfitextpdf-reader

Why is iTextSharp reading pages 1..N instead of N?


Here's my code:

var sb = new StringBuilder();
var st = new SimpleTextExtractionStrategy();
string raw;
using(var r = new iTextSharp.text.pdf.PdfReader(path)) {
    for(int pn = 1; pn <= r.NumberOfPages; pn++) {
        raw = iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(r, pn, st);
        sb.Append(raw);
    }
}

This works for almost all PDFs I've run across... until today:

http://www7.dleg.state.mi.us/orr/Files/AdminCode/356_10334_AdminCode.pdf

For this PDF (and others like it on the same site), the extracted text for page 1 is correct, but the text for page 2 contains pages 1 and 2, page 3 contains pages 1-3, etc. So my StringBuilder ends up with the text from pages 1, 1, 2, 1, 2, 3, 1, 2, 3, 4, etc.

Using the default Location-based strategy has the same issue (and won't work for these particular PDFs anyway).

I recently upgraded from a much older version of iTextSharp (5.1-ish?) and didn't experience this issue before (I believe I've parsed some of these files before without issue). I poked through the source and didn't see anything obvious.

I thought I could work around this by asking for only the last page, but this doesn't work -- I get only the last page. If I hard-code the loop to get pages 2..4, I get 2, 2, 3, 2, 3, 4. So the issue may be some sort of data that PdfReader is maintaining between calls to GetTextFromPage.


Solution

  • Change your code to something like this:

    var sb = new StringBuilder();
    string raw;
    using(var r = new iTextSharp.text.pdf.PdfReader(path)) {
        for(int pn = 1; pn <= r.NumberOfPages; pn++) {
            var st = new SimpleTextExtractionStrategy();
            raw = iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(r, pn, st);
            sb.Append(raw);
        }
    }
    

    Update based on mkl's comment: a strategy remembers all page content it has been confronted with. Thus, you have to use a fresh strategy if you want an extraction with nothing buffered yet.