Here's my code:
var sb = new StringBuilder();
var st = new SimpleTextExtractionStrategy();
string raw;
using(var r = new iTextSharp.text.pdf.PdfReader(path)) {
for(int pn = 1; pn <= r.NumberOfPages; pn++) {
raw = iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(r, pn, st);
sb.Append(raw);
}
}
This works for almost all PDFs I've run across... until today:
http://www7.dleg.state.mi.us/orr/Files/AdminCode/356_10334_AdminCode.pdf
For this PDF (and others like it on the same site), the extracted text for page 1 is correct, but the text for page 2 contains pages 1 and 2, page 3 contains pages 1-3, etc. So my StringBuilder
ends up with the text from pages 1, 1, 2, 1, 2, 3, 1, 2, 3, 4, etc.
Using the default Location-based strategy has the same issue (and won't work for these particular PDFs anyway).
I recently upgraded from a much older version of iTextSharp (5.1-ish?) and didn't experience this issue before (I believe I've parsed some of these files before without issue). I poked through the source and didn't see anything obvious.
I thought I could work around this by asking for only the last page, but this doesn't work -- I get only the last page. If I hard-code the loop to get pages 2..4, I get 2, 2, 3, 2, 3, 4. So the issue may be some sort of data that PdfReader
is maintaining between calls to GetTextFromPage
.
Change your code to something like this:
var sb = new StringBuilder();
string raw;
using(var r = new iTextSharp.text.pdf.PdfReader(path)) {
for(int pn = 1; pn <= r.NumberOfPages; pn++) {
var st = new SimpleTextExtractionStrategy();
raw = iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(r, pn, st);
sb.Append(raw);
}
}
Update based on mkl's comment: a strategy remembers all page content it has been confronted with. Thus, you have to use a fresh strategy if you want an extraction with nothing buffered yet.