I found in this forum some pretty good solutions how to extract images from PDF documents by using PDFBox. I used the following code snipped, that I found in one post:
PDPageTree list = document.getPages();
for (PDPage page : list) {
PDResources pdResources = page.getResources();
for (COSName c : pdResources.getXObjectNames()) {
try {
PDXObject imageObj = pdResources.getXObject(c);
if (imageObj instanceof PDImageXObject) {
// same image to list
BufferedImage bImage = ((PDImageXObject) imageObj).getImage();
acceptedImages.add(bImage);
}
} catch (MissingImageReaderException mex) {
log.warn("Missing Image Reader for format: ", mex);
}
}
}
But I got the problem, that in rare cases, some extracted images have a wrong orientation. When I look at the PDF document, the pictures are displayed correctl. But some of the extracted images are rotated by n x 90° degrees. I guess the rotation information is stored somewhere in the PDF?
Run the PrintImageLocations.java
example from the source code download (or here) and analyse the CTM ("current transformation matrix") to extract the rotation with Math.round(Math.toDegrees(Math.atan2(ctmNew.getShearY(), ctmNew.getScaleY())))
.