Search code examples
pdfpdfbox

PDFBox: how to determine bounding box of vector figure (path shape)


I have simple PDF generated by Apache FOP + jEuclid. This PDF has vector graphics for math formulas and the text:

enter image description here

Link to PDF: https://www.dropbox.com/s/w4ksnud78bu9oz5/test.pdf?dl=0

I would like to know bounding box (x,y,width,height) for each vector graphics. I've tried this example: https://svn.apache.org/repos/asf/pdfbox/tags/2.0.24/examples/src/main/java/org/apache/pdfbox/examples/util/PrintImageLocations.java, but it doesn't output any information , only this:

Processing page: 1

In the Acrobat I can select the vector images in the Tags tree and it highlights them: enter image description here

My question - how to determine bounding box for vector images via PDFBox API?


Solution

  • As long as the figures in question are appropriately tagged (as they are in your example document), you can determine their bounding boxes based on the PDFBox PDFGraphicsStreamEngine.

    You actually can make use of the BoundingBoxFinder from this answer (based on the PDFGraphicsStreamEngine) which determines the bounding box of all content of a page, you merely have to retrieve the bounding box information marked content sequence by marked content sequence.

    The following class does that by storing bounding box information in a hierarchy of MarkedContext objects

    public class MarkedContentBoundingBoxFinder extends BoundingBoxFinder {
        public MarkedContentBoundingBoxFinder(PDPage page) {
            super(page);
            contents.add(content);
        }
    
        @Override
        public void processPage(PDPage page) throws IOException {
            super.processPage(page);
            endMarkedContentSequence();
        }
    
        @Override
        public void beginMarkedContentSequence(COSName tag, COSDictionary properties) {
            MarkedContent current = contents.getLast();
            if (rectangle != null) {
                if (current.boundingBox != null)
                    add(current.boundingBox);
                current.boundingBox = rectangle;
            }
            rectangle = null;
            MarkedContent newContent = new MarkedContent(tag, properties);
            contents.addLast(newContent);
            current.children.add(newContent);
    
            super.beginMarkedContentSequence(tag, properties);
        }
    
        @Override
        public void endMarkedContentSequence() {
            MarkedContent current = contents.removeLast();
            if (rectangle != null) {
                if (current.boundingBox != null)
                    add(current.boundingBox);
                current.boundingBox = (Rectangle2D) rectangle.clone();
            } else if (current.boundingBox != null)
                rectangle = (Rectangle2D) current.boundingBox.clone();
    
            super.endMarkedContentSequence();
        }
    
        public static class MarkedContent {
            public MarkedContent(COSName tag, COSDictionary properties) {
                this.tag = tag;
                this.properties = properties;
            }
    
            public final COSName tag;
            public final COSDictionary properties;
            public final List<MarkedContent> children = new ArrayList<>();
            public Rectangle2D boundingBox = null;
        }
    
        public final MarkedContent content = new MarkedContent(COSName.DOCUMENT, null);
        public final Deque<MarkedContent> contents = new ArrayDeque<>();
    }
    

    (MarkedContentBoundingBoxFinder utility class)

    You can apply it to a PDPage pdPage like this

    MarkedContentBoundingBoxFinder boxFinder = new MarkedContentBoundingBoxFinder(pdPage);
    boxFinder.processPage(pdPage);
    MarkedContent markedContent = boxFinder.content;
    

    (excerpt from DetermineBoundingBox helper method drawMarkedContentBoundingBoxes)

    You can output the bounding boxes from that markedContent object like this:

    void printMarkedContentBoundingBoxes(MarkedContent markedContent, String prefix) {
        StringBuilder builder = new StringBuilder();
        builder.append(prefix).append(markedContent.tag.getName());
        builder.append(' ').append(markedContent.boundingBox);
        System.out.println(builder.toString());
        for (MarkedContent child : markedContent.children)
            printMarkedContentBoundingBoxes(child, prefix + "  ");
    }
    

    (DetermineBoundingBox helper method)

    In case of your example document you get

    Document java.awt.geom.Rectangle2D$Double[x=90.35800170898438,y=758.10498046875,w=128.63946533203125,h=10.2509765625]
      Figure java.awt.geom.Rectangle2D$Double[x=90.35800170898438,y=758.10498046875,w=44.6771240234375,h=10.2509765625]
      P java.awt.geom.Rectangle2D$Double[x=136.79600524902344,y=760.1184081963065,w=43.137100359018405,h=6.383056943803922]
      Figure java.awt.geom.Rectangle2D$Double[x=184.2926788330078,y=758.10498046875,w=34.70478820800781,h=10.2509765625]
    

    Similarly you can draw the bounding boxes on the PDF using the drawMarkedContentBoundingBoxes methods of DetermineBoundingBox. In case of your example document you get:

    screen shot