Search code examples
javapdfpdfbox

Remove image from PDF using PDFBox


I'd like to remove certain images shown in pages using PDFBox library. As far as I know the most appropriate way of identifying an image is to its "name" in XOBJECT directory. So theoretically everything what should be done is to remove such an image from XOBJECT directory and operators that display that image. This is how I dealt with the problem of removing an image from a page. Using PDFBox Debugger I found image's id (Im3) and the instructions displaying it on a page:

enter image description here

My question is how what instructions before and after the instruction that displays the image should I remove (Do is used to display image). I decided to remove all from (thus including) /Gs1 gs up to /Im3 Do. Is this correct approach? Here's my code. It is working correctly, ie. there is no page with unwanted image. I use latest (ie. v3) of PDFBox.

import org.apache.pdfbox.contentstream.operator.Operator;
import org.apache.pdfbox.cos.COSDictionary;
import org.apache.pdfbox.cos.COSName;
import org.apache.pdfbox.pdfparser.PDFStreamParser;
import org.apache.pdfbox.pdfwriter.ContentStreamWriter;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.common.PDStream;

import java.io.IOException;
import java.io.OutputStream;
import java.util.List;

import static org.apache.pdfbox.contentstream.operator.OperatorName.DRAW_OBJECT;
import static org.apache.pdfbox.contentstream.operator.OperatorName.SET_GRAPHICS_STATE_PARAMS;


public class GraphicRemover {

    private static final COSName X_OBJECT_NAME_TO_REMOVE = COSName.getPDFName("Im3");

    public void remove(final PDDocument document) throws IOException {

        for (final PDPage page : document.getPages()) {

            final PDFStreamParser parser = new PDFStreamParser(page);
            final List<Object> tokens = parser.parse();

            final boolean hasImage = tokens.stream().anyMatch(X_OBJECT_NAME_TO_REMOVE::equals);

            if (!hasImage) {
                continue;
            }

            for (int i = tokens.size() - 1; i >= 0; i--) {

                if (!tokens.get(i).equals(X_OBJECT_NAME_TO_REMOVE)) {
                    continue;
                }

                int indexOfGraphicStartCommand = i - 1;
                while (!Operator.getOperator(SET_GRAPHICS_STATE_PARAMS).equals(tokens.get(indexOfGraphicStartCommand))) {
                    --indexOfGraphicStartCommand;
                }

                int indexOfDisplayGraphicCommand = i;
                while (!Operator.getOperator(DRAW_OBJECT).equals(tokens.get(indexOfDisplayGraphicCommand))) {
                    ++indexOfDisplayGraphicCommand;
                }
                final int indexOfDisplayGraphicArgument = --indexOfGraphicStartCommand;
                tokens.subList(indexOfDisplayGraphicArgument, indexOfDisplayGraphicCommand).clear();
                final PDStream newContents = new PDStream(document);
                final OutputStream newContentOutput = newContents.createOutputStream(COSName.FLATE_DECODE);
                final ContentStreamWriter newContentWriter = new ContentStreamWriter(newContentOutput);
                newContentWriter.writeTokens(tokens);
                newContentOutput.close();
                page.setContents(newContents);
                removeWatermarkObject(page);
                break;
            }
        }
    }

    private void removeWatermarkObject(final PDPage page) {
        ((COSDictionary) page.getResources().getCOSObject()
                .getDictionaryObject(COSName.XOBJECT)).removeItem(X_OBJECT_NAME_TO_REMOVE);
        removeEmptyXobjects(page);
    }

    private void removeEmptyXobjects(final PDPage page) {
        final COSDictionary xObjects =
                (COSDictionary) page.getResources().getCOSObject().getDictionaryObject(COSName.XOBJECT);

        if (xObjects == null || xObjects.size() == 0) {
            page.getResources().getCOSObject().removeItem(COSName.XOBJECT);
        }
    }
} 

Solution

  • I decided to remove all from (thus including) /Gs1 gs up to /Im3 Do. Is this correct approach?

    If you include the opening q, you also have to include the closing Q: they mean save-graphics-state (on a stack) and restore-graphics-state (from that stack), so only removing the opening q can have a severe impact on the following content.

    Alternatively it would also be ok to only remove the /Im3 Do.

    In a comment you followed up with:

    So the output should be as follows:

    https://imgur.com/a/YW1EkAZ

    That would be a valid option.

    But if you really want to remove more than just the /Im3 Do, you can also remove the remaining q /Perceptual ri Q as that essentially is a NOP.