images inverted and split when extracting images from pdf document by using PDFBox or Poppler

want to extract whole images per page in a pdf document by using PDFBox in JAVA. but all extracted images were inverted and split. It should be noted that it's not a bug in PDFBox or poppler but some format reasons of the pdf document itself. so how can i piece together the whole image and get the right direction of every image? could anybody give me some advices? a snippet of JAVA code is preferred. my pdf link: download

Solution

At first glance it looked like each of the figures in question was drawn in a separate block of content stream instructions enveloped by but not containing text objects. Thus, one approach to isolate them is to export all such blocks of instructions to a separate new page. You then can post-process these new pages, e.g. by rendering them as bitmap images using a PdfRenderer.

I based code doing this on the PdfContentStreamEditor originally from this answer like this:

PDDocument document = PDDocument.load(...);

for (PDPage page : document.getDocumentCatalog().getPages()) {
    PdfContentStreamEditor editor = new PdfContentStreamEditor(document, page) {
        ByteArrayOutputStream commonRaw = null;
        ContentStreamWriter commonWriter = null;
        int depth = 0;

        @Override
        public void processPage(PDPage page) throws IOException {
            commonRaw = new ByteArrayOutputStream();
            try {
                commonWriter = new ContentStreamWriter(commonRaw);
                startFigurePage(page);
                super.processPage(page);
            } finally {
                endFigurePage();
                commonRaw.close();
            }
        }

        @Override
        protected void write(ContentStreamWriter contentStreamWriter, Operator operator,
                List<COSBase> operands) throws IOException {
            String operatorString = operator.getName();
            if (operatorString.equals("BT")) {
                endFigurePage();
            }
            if (operatorString.equals("q")) {
                depth++;
            }
            writeFigure(operator, operands);
            if (operatorString.equals("Q")) {
                depth--;
            }
            if (operatorString.equals("ET")) {
                startFigurePage(getCurrentPage());
            }

            super.write(contentStreamWriter, operator, operands);
        }

        OutputStream figureRaw = null;
        ContentStreamWriter figureWriter = null;
        PDPage figurePage = null;
        int xobjectsDrawn = 0;
        int pathsPainted = 0;

        void startFigurePage(PDPage currentPage) throws IOException {
            figurePage = new PDPage(currentPage.getMediaBox());
            figurePage.setResources(currentPage.getResources());
            PDStream stream = new PDStream(document);
            figurePage.setContents(stream);
            figureWriter = new ContentStreamWriter(figureRaw = stream.createOutputStream(COSName.FLATE_DECODE));
            figureRaw.write(commonRaw.toByteArray());
            xobjectsDrawn = 0;
            pathsPainted = 0;
        }

        void endFigurePage() throws IOException {
            if (figureWriter != null) {
                figureWriter = null;
                figureRaw.close();
                figureRaw = null;
                if (xobjectsDrawn > 0 || pathsPainted > 3)
                    document.addPage(figurePage);
                figurePage = null;
            }
        }

        final List<String> PATH_PAINTING_OPERATORS = Arrays.asList("S", "s", "F", "f", "f*",
                "B", "B*", "b", "b*");

        void writeFigure(Operator operator, List<COSBase> operands) throws IOException {
            if (figureWriter != null) {
                String operatorString = operator.getName();
                boolean isXObjectDo = operatorString.equals("Do");
                boolean isPathPainting = PATH_PAINTING_OPERATORS.contains(operatorString);
                if (isXObjectDo)
                    xobjectsDrawn++;
                if (isPathPainting)
                    pathsPainted++;
                figureWriter.writeTokens(operands);
                figureWriter.writeToken(operator);
                if (depth == 0) {
                    if (!isXObjectDo) {
                        if (isPathPainting)
                            operator = Operator.getOperator("n");
                        commonWriter.writeTokens(operands);
                        commonWriter.writeToken(operator);
                    }
                }
            }
        }
    };
    editor.processPage(page);
}

document.save(new File(RESULT_FOLDER, "my-isolatedFigures.pdf"));

(IsolateFigures test testIsolateInMy)

The first figures are extracted quite fine:

S30 a	S30 b	S31 a	S31 b

Certain figures, though, turn out to contain text objects and, therefore, are separated in partial images and lose their text content:

S32 b 1	S32 b 2	S32 b 3	S32 b 4