Search code examples
pdfbox

Detecting rotated/inverted images with PDFBox


I've got some PDFs which have images that are rotated and inverted. Sample of the image:

inverted and rotated text

The page object itself has no rotation instructions but it appears they may be using the cm operator in the content stream to perform the transformation for the rendered PDF:

...snip...
q
  0.05 0 0 -0.05 0 768 cm
  q
    0 0 11880 15360 re
    W*
    n
    /GS0 gs
    1 J
    [ ] 0 d
    2 w
    0 0 0 RG
    /GS0 gs
    1 1 1 rg
    /GS0 gs
    1 1 1 rg
    /GS0 gs
    1 J
    [ ] 0 d
    2 w
    0 0 0 RG
    q
      11865 0 0 15360 0 -3 cm
      /Image1 Do
    Q
...snip...    

Am I on the right track here?

We're already using PDFStreamEngine to analyze images so I thought maybe using the current graphics state would have these available:

protected class DrawObjectCounter extends OperatorProcessor {
   @Override
   public void process(Operator operator, List<COSBase> operands) throws IOException {
       System.out.println(getGraphicsState().getCurrentTransformationMatrix());
       ...snip...
   }
...snip...   

The output is always:

[1.0,0.0,0.0,1.0,0.0,0.0]

Do I need to keep track of the CM operators with another OperatorProcessor or am I just not looking in the right place?


Solution

  • I had a problem in the construction of my PDFStreamEngine. I needed to register operators to perform the matrix modifications:

    public PdfImageStreamEngine() {
        // pdfbox operators
        addOperator(new Save());
        addOperator(new Concatenate());
        addOperator(new Restore());
        // custom operators
        addOperator(new DrawObjectCounter());
        addOperator(new BeginInlineImageCounter());
        addOperator(new LineToCounter());
    }
    

    Once I registered the proper operator processors, the getCurrentTransformationMatrix had the correct value.