I am trying to find the text position in PDF page?
What I have tried is to get the text in the PDF page by PDF Text Extractor using simple text extraction strategy. I am looping each word to check if my word exists. split the words using:
var Words = pdftextextractor.Split(new char[] { ' ', '\n' });
What I wasn't able to do is to find the text position. The problem is I wasn't able to find the location of the text. All I need to find is the y co-ordinates of the word in the PDF file.
First, SimpleTextExtractionStrategy is not exactly the 'smartest' strategy (as the name would suggest.
Second, if you want the position you're going to have to do a lot more work. TextExtractionStrategy assumes you are only interested in the text.
Possible implementation:
how to:
ITextExtractionStrategy has the following method in its interface:
@Override
public void eventOccurred(IEventData data, EventType type) {
// you can first check the type of the event
if (!type.equals(EventType.RENDER_TEXT))
return;
// now it is safe to cast
TextRenderInfo renderInfo = (TextRenderInfo) data;
}
Important to keep in mind is that rendering instructions in a pdf do not need to appear in order.
The text "Lorem Ipsum Dolor Sit Amet" could be rendered with instructions similar to:
render "Ipsum Do"
render "Lorem "
render "lor Sit Amet"
You will have to do some clever merging (depending on how far apart two TextRenderInfo objects are), and sorting (to get all the TextRenderInfo objects in the proper reading order.
Once that's done, it should be easy.