Search code examples
javapdfbox

PDFBox 2.0: invisible lines on rotated page - clip path issue


File example: click here

Using great solution from this topic I try to extract visible text. Attached document has very small text which maybe cause this clip path problem where some part of letters could be hidden. For such rotated text I changed code from linked issue a bit:

    @Override
    protected void processTextPosition(TextPosition text) {
        PDGraphicsState gs = getGraphicsState();                            

        Vector center = getTextPositionCenterPoint(text);
        Area area = gs.getCurrentClippingPath();
        if (area == null || area.contains(lowerLeftX + center.getX(), lowerLeftY + center.getY())) {            
            nonStrokingColors.put(text, gs.getNonStrokingColor());
            renderingModes.put(text, gs.getTextState().getRenderingMode());
            super.processTextPosition(text);
        }
    }


private Vector getTextPositionCenterPoint(TextPosition text) {
        Matrix textMatrix = text.getTextMatrix();
        Vector start = textMatrix.transform(new Vector(0, 0));
        Vector center = null;
        switch (rotation) {
        case 0:
            center = new Vector(start.getX() + text.getWidth()/2, start.getY()); 
            break;
        case 90:
            center = new Vector(start.getX(), start.getY() + text.getWidth()/2);
            break;
        case 180:
            center = new Vector(start.getX() - text.getWidth()/2, start.getY());
            break;
        case 270:
            center = new Vector(start.getX(), start.getY() - text.getWidth()/2);
            break;
        default:
            center = new Vector(start.getX() + text.getWidth()/2, start.getY());
            break;
        }

        return center;
    }

What I'm trying to do - get character X-center point depending on rotation (I'm aware that sometimes this does not work because of text direction, however here it looks like this is not the case) But after applying this solution I have 2nd, 3rd and some others rows in the bottom skipped because of clip path. I'm wondering where is my mistake. Thanks in advance!


Solution

  • Problems with your PDF are caused by a combination of

    • text coordinates being exactly on the clip path border;
    • different calculation paths for text coordinates and clip path coordinates with different floating point errors resulting in text coordinates on clip path borders sometimes being calculated as outside the clip path.

    Your attempt to change this unfortunately does not help here: The problem texts have their baseline coinciding with the clip path border, and your getTextPositionCenterPoint only centers along the baseline, so the centered point has issues exactly of the glyph origin has problems.

    A different work around works better: using a fat point comparison. That means that instead of checking whether a given point x, y is in the clip area, we check whether a small rectangle around those coordinates intersects the clip area. In case of coordinates wandering out of the clip area due to floating point errors, this suffices to find them in the clip area nonetheless.

    To do this, we replace the area.contains(x, y) checks in processTextPosition by contains(area, x, y) which is implemented as

    protected boolean contains(Area area, float x, float y) {
        double length = .0002;
        double up = 1.0001;
        double down = .9999;
        return area.intersects(x < 0 ? x*up : x*down, y < 0 ? y*up : y*down, Math.abs(x*length), Math.abs(y*length));
    }
    

    (PDFVisibleTextStripper helper method)

    (Actually the choice of the rectangle around the coordinates here is somewhat arbitrary, the choice simply worked for me.)

    With this change I get your missing 2nd, 3rd and some others rows in the bottom, cf. the test ExtractVisibleText.testFat1.