I have simple PDF generated by Apache FOP + jEuclid. This PDF has vector graphics for math formulas and the text:
Link to PDF: https://www.dropbox.com/s/w4ksnud78bu9oz5/test.pdf?dl=0
I would like to know bounding box (x,y,width,height) for each vector graphics. I've tried this example: https://svn.apache.org/repos/asf/pdfbox/tags/2.0.24/examples/src/main/java/org/apache/pdfbox/examples/util/PrintImageLocations.java, but it doesn't output any information , only this:
Processing page: 1
In the Acrobat I can select the vector images in the Tags tree and it highlights them:
My question - how to determine bounding box for vector images via PDFBox API?
As long as the figures in question are appropriately tagged (as they are in your example document), you can determine their bounding boxes based on the PDFBox PDFGraphicsStreamEngine
.
You actually can make use of the BoundingBoxFinder
from this answer (based on the PDFGraphicsStreamEngine
) which determines the bounding box of all content of a page, you merely have to retrieve the bounding box information marked content sequence by marked content sequence.
The following class does that by storing bounding box information in a hierarchy of MarkedContext
objects
public class MarkedContentBoundingBoxFinder extends BoundingBoxFinder {
public MarkedContentBoundingBoxFinder(PDPage page) {
super(page);
contents.add(content);
}
@Override
public void processPage(PDPage page) throws IOException {
super.processPage(page);
endMarkedContentSequence();
}
@Override
public void beginMarkedContentSequence(COSName tag, COSDictionary properties) {
MarkedContent current = contents.getLast();
if (rectangle != null) {
if (current.boundingBox != null)
add(current.boundingBox);
current.boundingBox = rectangle;
}
rectangle = null;
MarkedContent newContent = new MarkedContent(tag, properties);
contents.addLast(newContent);
current.children.add(newContent);
super.beginMarkedContentSequence(tag, properties);
}
@Override
public void endMarkedContentSequence() {
MarkedContent current = contents.removeLast();
if (rectangle != null) {
if (current.boundingBox != null)
add(current.boundingBox);
current.boundingBox = (Rectangle2D) rectangle.clone();
} else if (current.boundingBox != null)
rectangle = (Rectangle2D) current.boundingBox.clone();
super.endMarkedContentSequence();
}
public static class MarkedContent {
public MarkedContent(COSName tag, COSDictionary properties) {
this.tag = tag;
this.properties = properties;
}
public final COSName tag;
public final COSDictionary properties;
public final List<MarkedContent> children = new ArrayList<>();
public Rectangle2D boundingBox = null;
}
public final MarkedContent content = new MarkedContent(COSName.DOCUMENT, null);
public final Deque<MarkedContent> contents = new ArrayDeque<>();
}
(MarkedContentBoundingBoxFinder utility class)
You can apply it to a PDPage pdPage
like this
MarkedContentBoundingBoxFinder boxFinder = new MarkedContentBoundingBoxFinder(pdPage);
boxFinder.processPage(pdPage);
MarkedContent markedContent = boxFinder.content;
(excerpt from DetermineBoundingBox helper method drawMarkedContentBoundingBoxes
)
You can output the bounding boxes from that markedContent
object like this:
void printMarkedContentBoundingBoxes(MarkedContent markedContent, String prefix) {
StringBuilder builder = new StringBuilder();
builder.append(prefix).append(markedContent.tag.getName());
builder.append(' ').append(markedContent.boundingBox);
System.out.println(builder.toString());
for (MarkedContent child : markedContent.children)
printMarkedContentBoundingBoxes(child, prefix + " ");
}
(DetermineBoundingBox helper method)
In case of your example document you get
Document java.awt.geom.Rectangle2D$Double[x=90.35800170898438,y=758.10498046875,w=128.63946533203125,h=10.2509765625]
Figure java.awt.geom.Rectangle2D$Double[x=90.35800170898438,y=758.10498046875,w=44.6771240234375,h=10.2509765625]
P java.awt.geom.Rectangle2D$Double[x=136.79600524902344,y=760.1184081963065,w=43.137100359018405,h=6.383056943803922]
Figure java.awt.geom.Rectangle2D$Double[x=184.2926788330078,y=758.10498046875,w=34.70478820800781,h=10.2509765625]
Similarly you can draw the bounding boxes on the PDF using the drawMarkedContentBoundingBoxes
methods of DetermineBoundingBox. In case of your example document you get: