Search code examples
javapdfbox

How to search some specific string or a word and their coordinates from a pdf document in java?


I am using Pdfbox to search a word (or String) from a PDF file and I also want to know the coordinates of that word. For example :- in a PDF file, there is a string like "${abc}". I want to know the coordinates of this string. I tried some couple of examples, but didn't get the result according to me. In the result, it is displaying the coordinates of character.

Here is the code:

@Override
protected void writeString(String string, List<TextPosition> textPositions) throws IOException {
    for(TextPosition text : textPositions) {
      
        
        System.out.println( "String[" + text.getXDirAdj() + "," +
                text.getYDirAdj() + " fs=" + text.getFontSize() + " xscale=" +
                text.getXScale() + " height=" + text.getHeightDir() + " space=" +
                text.getWidthOfSpace() + " width=" +
                text.getWidthDirAdj() + "]" + text.getUnicode());

    }
}

I am using PdfBox 2.0.


Solution

  • The last method in which PDFBox' PDFTextStripper class still has text with positions (before it is reduced to plain text) is the method

    /**
     * Write a Java string to the output stream. The default implementation will ignore the <code>textPositions</code>
     * and just calls {@link #writeString(String)}.
     *
     * @param text The text to write to the stream.
     * @param textPositions The TextPositions belonging to the text.
     * @throws IOException If there is an error when writing the text.
     */
    protected void writeString(String text, List<TextPosition> textPositions) throws IOException
    

    One should intercept here because this method receives pre-processed, in particular sorted TextPosition objects (if one requested sorting to start with).

    (Actually I would have preferred to intercept in the calling method writeLine which according to the names of its parameters and local variables has all the TextPosition instances of a line and calls writeString once per word; unfortunately, though, PDFBox developers have declared this method private... well, maybe this changes until the final 2.0.0 release... nudge, nudge. Update: Unfortunately it has not changed in the release... sigh)

    Furthermore it is helpful to use a helper class to wrap sequences of TextPosition instances in a String-like class to make code clearer.

    With this in mind one can search for the variables like this

    List<TextPositionSequence> findSubwords(PDDocument document, int page, String searchTerm) throws IOException
    {
        final List<TextPositionSequence> hits = new ArrayList<TextPositionSequence>();
        PDFTextStripper stripper = new PDFTextStripper()
        {
            @Override
            protected void writeString(String text, List<TextPosition> textPositions) throws IOException
            {
                TextPositionSequence word = new TextPositionSequence(textPositions);
                String string = word.toString();
    
                int fromIndex = 0;
                int index;
                while ((index = string.indexOf(searchTerm, fromIndex)) > -1)
                {
                    hits.add(word.subSequence(index, index + searchTerm.length()));
                    fromIndex = index + 1;
                }
                super.writeString(text, textPositions);
            }
        };
        
        stripper.setSortByPosition(true);
        stripper.setStartPage(page);
        stripper.setEndPage(page);
        stripper.getText(document);
        return hits;
    }
    

    with this helper class

    public class TextPositionSequence implements CharSequence
    {
        public TextPositionSequence(List<TextPosition> textPositions)
        {
            this(textPositions, 0, textPositions.size());
        }
    
        public TextPositionSequence(List<TextPosition> textPositions, int start, int end)
        {
            this.textPositions = textPositions;
            this.start = start;
            this.end = end;
        }
    
        @Override
        public int length()
        {
            return end - start;
        }
    
        @Override
        public char charAt(int index)
        {
            TextPosition textPosition = textPositionAt(index);
            String text = textPosition.getUnicode();
            return text.charAt(0);
        }
    
        @Override
        public TextPositionSequence subSequence(int start, int end)
        {
            return new TextPositionSequence(textPositions, this.start + start, this.start + end);
        }
    
        @Override
        public String toString()
        {
            StringBuilder builder = new StringBuilder(length());
            for (int i = 0; i < length(); i++)
            {
                builder.append(charAt(i));
            }
            return builder.toString();
        }
    
        public TextPosition textPositionAt(int index)
        {
            return textPositions.get(start + index);
        }
    
        public float getX()
        {
            return textPositions.get(start).getXDirAdj();
        }
    
        public float getY()
        {
            return textPositions.get(start).getYDirAdj();
        }
    
        public float getWidth()
        {
            if (end == start)
                return 0;
            TextPosition first = textPositions.get(start);
            TextPosition last = textPositions.get(end - 1);
            return last.getWidthDirAdj() + last.getXDirAdj() - first.getXDirAdj();
        }
    
        final List<TextPosition> textPositions;
        final int start, end;
    }
    

    To merely output their positions, widths, final letters, and final letter positions, you can then use this

    void printSubwords(PDDocument document, String searchTerm) throws IOException
    {
        System.out.printf("* Looking for '%s'\n", searchTerm);
        for (int page = 1; page <= document.getNumberOfPages(); page++)
        {
            List<TextPositionSequence> hits = findSubwords(document, page, searchTerm);
            for (TextPositionSequence hit : hits)
            {
                TextPosition lastPosition = hit.textPositionAt(hit.length() - 1);
                System.out.printf("  Page %s at %s, %s with width %s and last letter '%s' at %s, %s\n",
                        page, hit.getX(), hit.getY(), hit.getWidth(),
                        lastPosition.getUnicode(), lastPosition.getXDirAdj(), lastPosition.getYDirAdj());
            }
        }
    }
    

    For tests I created a small test file using MS Word:

    Sample file with variables

    The output of this test

    @Test
    public void testVariables() throws IOException
    {
        try (   InputStream resource = getClass().getResourceAsStream("Variables.pdf");
                PDDocument document = PDDocument.load(resource);    )
        {
            System.out.println("\nVariables.pdf\n-------------\n");
            printSubwords(document, "${var1}");
            printSubwords(document, "${var 2}");
        }
    }
    

    is

    Variables.pdf
    -------------
    
    * Looking for '${var1}'
      Page 1 at 164.39648, 158.06 with width 34.67856 and last letter '}' at 193.22, 158.06
      Page 1 at 188.75699, 174.13995 with width 34.58806 and last letter '}' at 217.49, 174.13995
      Page 1 at 167.49583, 190.21997 with width 38.000168 and last letter '}' at 196.22, 190.21997
      Page 1 at 176.67009, 206.18 with width 35.667114 and last letter '}' at 205.49, 206.18
    
    * Looking for '${var 2}'
      Page 1 at 164.39648, 257.65997 with width 37.078552 and last letter '}' at 195.62, 257.65997
      Page 1 at 188.75699, 273.74 with width 37.108047 and last letter '}' at 220.01, 273.74
      Page 1 at 167.49583, 289.72998 with width 40.55017 and last letter '}' at 198.74, 289.72998
      Page 1 at 176.67778, 305.81 with width 38.059418 and last letter '}' at 207.89, 305.81
    

    I was a bit surprised because ${var 2} has been found if on a single line; after all, PDFBox code made me assume the method writeString I overrode only retrieves words; it looks as if it retrieves longer parts of the line than mere words...

    If you need other data from the grouped TextPosition instances, simply enhance TextPositionSequence accordingly.