I have got code that is meant to extract text from a user created rectangle on a PDF.
I am using ITextSharp for this.
The user inputs the co-ordinates of where they want the rectangle to be and they can 'preview' the rectangle, which draws a red rectangle over their pdf, or 'generate' a new pdf, which is meant to capture text within that rectangle and add an extra page to the pdf with just this text.
My issue is, the text is being captured from an area completely seperate to the preview rectangle. Both rectangles are created in the same way:
//Preview rectangle code
var xfer = ConvertToPoint(Convert.ToDouble(ULTB.Text));
var yfer = ConvertToPoint(Convert.ToDouble(LLTB.Text));
var uxfer = ConvertToPoint(Convert.ToDouble(URTB.Text));
var uyfer = ConvertToPoint(Convert.ToDouble(LRTB.Text));
iTextSharp.text.Rectangle rect = new iTextSharp.text.Rectangle((float)xfer, (float)yfer, (float)uxfer, (float)uyfer);
This rectangle is then drawn onto the user document.
(ConvertToPoint just converts the user input into a point rather than mm)
Using the exact same user input, the rectangle created by the following code is in a different location:
var xfer = ConvertToPoint(Convert.ToDouble(ULTB.Text));
var yfer = ConvertToPoint(Convert.ToDouble(LLTB.Text));
var uxfer = ConvertToPoint(Convert.ToDouble(URTB.Text));
var uyfer = ConvertToPoint(Convert.ToDouble(LRTB.Text));
RenderFilter[] filters = new RenderFilter[1];
LocationTextExtractionStrategy regionFilter = new LocationTextExtractionStrategy();
filters[0] = new RegionTextRenderFilter(new iTextSharp.text.Rectangle((float)xfer, (float)yfer, (float)uxfer, (float)uyfer));
FilteredTextRenderListener strategy = new FilteredTextRenderListener(regionFilter, filters);
String result = PdfTextExtractor.GetTextFromPage(reader, x, strategy);
The above code should get the text from the position from the users coordinates but is not, any ideas?
I've attached the PDF onto Google drive, with a lot of text redacted
The red rectangle is what i get via the preview code and the text towards the bottom of the document is whats being picked up by the text capture
The problem is that the page rotation property is not 0 here.
iTextSharp has the "feature" of by default transforming the coordinates in changes you apply to the content to align to the page rotation. It does not likewise transform the coordinates during text extraction.
Fortunately iTextSharp allows to switch off that transformation, if you have a PdfStamper pdfStamper
, simply set
pdfStamper.RotateContents = false;
right after initializing the stamper.
Of course this means that you have to take the page rotation into account in your code. But it also means that you can do so consistently.