Search code examples
javapdfjpedal

JPedal - Highlight word at a point in a PDF


I want to implement a feature which allows the user to double-click to highlight a word in a PDF document using the JPedal library. This would be trivial to do if I could get a word's bounding rectangle and see if the MouseEvent location falls within it; the following snippet demonstrates how to highlight a region:

private void highlightText() {
    Rectangle highlightRectangle = new Rectangle(firstPoint.x, firstPoint.y,
            secondPoint.x - firstPoint.x, secondPoint.y - firstPoint.y);
    pdfDecoder.getTextLines().addHighlights(new Rectangle[]{highlightRectangle}, false, currentPage);
    pdfDecoder.repaint();
}

I can only find plaintext extraction examples in the documentation however.


Solution

  • After looking at Mark's examples I managed to get it working. There are a few quirks so I'll explain how it all works in case it helps someone else. The key method is extractTextAsWordlist, which returns a List<String> of the form {word1, w1_x1, w1_y1, w1_x2, w1_y2, word2, w2_x1, ...} when given a region to extract from. Step-by-step instructions are listed below.

    Firstly, you need to transform the MouseEvent's Component/screen coordinates to PDF page coordinates and correct for scaling:

    /**
     * Transforms Component coordinates to page coordinates, correcting for 
     * scaling and panning.
     * 
     * @param x Component x-coordinate
     * @param y Component y-coordinate
     * @return Point on the PDF page
     */
    private Point getPageCoordinates(int x, int y) {
        float scaling = pdfDecoder.getScaling();
        int x_offset = ((pdfDecoder.getWidth() - pdfDecoder.getPDFWidth()) / 2); 
        int y_offset = pdfDecoder.getPDFHeight();
        int correctedX = (int)((x - x_offset + viewportOffset.x) / scaling);
        int correctedY = (int)((y_offset - (y + viewportOffset.y))  / scaling);
        return new Point(correctedX, correctedY);
    }
    

    Next, create a box to scan for text. I chose to make this the width of the page and +/- 20 page units vertically (this is a fairly arbitrary number), centered at the MouseEvent:

    /**
     * Scans for all the words located with in a box the width of the page and 
     * 40 points high, centered at the supplied point.
     * 
     * @param p Point to centre the scan box around
     * @return  A List of words within the scan box
     * @throws PdfException
     */
    private List<String> scanForWords(Point p) throws PdfException {
        List<String> result = Collections.emptyList();
        if (pdfDecoder.getlastPageDecoded() > 0) {
            PdfGroupingAlgorithms currentGrouping = pdfDecoder.getGroupingObject();
            PdfPageData currentPageData = pdfDecoder.getPdfPageData();
            int x1 = currentPageData.getMediaBoxX(currentPage);
            int x2 = currentPageData.getMediaBoxWidth(currentPage) + x1;
            int y1 = p.y + 20;
            int y2 = p.y - 20;
            result = currentGrouping.extractTextAsWordlist(x1, y1, x2, y2, currentPage, true, "");
        }
        return result;
    }
    

    Then I parsed this into a sequence of Rectangles:

    /**
     * Parse a String sequence of:
     *   {word1, w1_x1, w1_y1, w1_x2, w1_y2, word2, w2_x1, ...}
     *   
     * Into a sequence of Rectangles.
     * 
     * @param wordList Word list sequence to parse
     * @return A List of Rectangles
     */
    private List<Rectangle> parseWordBounds(List<String> wordList) {
        List<Rectangle> wordBounds = new LinkedList<Rectangle>();
        Iterator<String> wordListIterator = wordList.iterator();
        while(wordListIterator.hasNext()) {
            // sequences are: {word, x1, y1, x2, y2}  
            wordListIterator.next(); // skip the word
            int x1 = (int) Float.parseFloat(wordListIterator.next());
            int y1 = (int) Float.parseFloat(wordListIterator.next());
            int x2 = (int) Float.parseFloat(wordListIterator.next());
            int y2 = (int) Float.parseFloat(wordListIterator.next());
            wordBounds.add(new Rectangle(x1, y2, x2 - x1, y1 - y2)); // in page, not screen coordinates
        }
        return wordBounds;
    }
    

    Then identified which Rectangle the MouseEvent fell within:

    /**
     * Finds the bounding Rectangle of a word located at a Point.
     * 
     * @param p Point to find word bounds
     * @param wordBounds List of word boundaries to search
     * @return A Rectangle that bounds a word and contains a point, or null if 
     *         there is no word located at the point
     */
    private Rectangle findWordBoundsAtPoint(Point p, List<Rectangle> wordBounds) {
        Rectangle result = null;
        for (Rectangle wordBound : wordBounds) {
            if (wordBound.contains(p)) {
                result = wordBound;
                break;
            }
        }
        return result;
    }
    

    For some reason, just passing this Rectangle to the highlighting method didn't work. After some tinkering, I found that shrinking the Rectangle by a point on each side resolved the problem:

    /**
     * Contracts a Rectangle to enable it to be highlighted.
     * 
     * @return A contracted Highlight Rectangle
     */
    private Rectangle contractHighlight(Rectangle highlight){
        int x = highlight.x + 1;
        int y = highlight.y + 1;
        int width = highlight.width -2;
        int height = highlight.height - 2;
        return new Rectangle(x, y, width, height);
    }
    

    Then I just passed it to this method to add highlights:

    /**
     * Highlights text on the document
     */
    private void highlightText(Rectangle highlightRectangle) {
        pdfDecoder.getTextLines().addHighlights(new Rectangle[]{highlightRectangle}, false, currentPage);
        pdfDecoder.repaint();
    }
    

    Finally, all the above calls are packed into this convenient method:

    /**
     * Highlights the word at the given point.
     * 
     * @param p Point where word is located
     */
    private void highlightWordAtPoint(Point p) {
        try {
            Rectangle wordBounds = findWordBoundsAtPoint(p, parseWordBounds(scanForWords(p)));
            if (wordBounds != null) {
                highlightText(contractHighlight(wordBounds));
            }
        } catch (PdfException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
    }