Search code examples
itext7

How to get the text position from the pdf page in iText 7


I am trying to find the text position in PDF page?

What I have tried is to get the text in the PDF page by PDF Text Extractor using simple text extraction strategy. I am looping each word to check if my word exists. split the words using:

var Words = pdftextextractor.Split(new char[] { ' ', '\n' });

What I wasn't able to do is to find the text position. The problem is I wasn't able to find the location of the text. All I need to find is the y co-ordinates of the word in the PDF file.


Solution

  • First, SimpleTextExtractionStrategy is not exactly the 'smartest' strategy (as the name would suggest.

    Second, if you want the position you're going to have to do a lot more work. TextExtractionStrategy assumes you are only interested in the text.

    Possible implementation:

    • implement IEventListener
    • get notified for all events that render text, and store the corresponding TextRenderInfo object
    • once you're finished with the document, sort these objects based on their position in the page
    • loop over this list of TextRenderInfo objects, they offer both the text being rendered and the coordinates

    how to:

    1. implement ITextExtractionStrategy (or extend an existing implementation)
    2. use PdfTextExtractor.getTextFromPage(doc.getPage(pageNr), strategy), where strategy denotes the strategy you created in step 1
    3. your strategy should be set up to keep track of locations for the text it processed

    ITextExtractionStrategy has the following method in its interface:

    @Override
    public void eventOccurred(IEventData data, EventType type) {
    
        // you can first check the type of the event
         if (!type.equals(EventType.RENDER_TEXT))
            return;
    
        // now it is safe to cast
        TextRenderInfo renderInfo = (TextRenderInfo) data;
    }
    

    Important to keep in mind is that rendering instructions in a pdf do not need to appear in order. The text "Lorem Ipsum Dolor Sit Amet" could be rendered with instructions similar to: render "Ipsum Do"
    render "Lorem "
    render "lor Sit Amet"

    You will have to do some clever merging (depending on how far apart two TextRenderInfo objects are), and sorting (to get all the TextRenderInfo objects in the proper reading order.

    Once that's done, it should be easy.