Search code examples
javapdfclown

PDFClown: Creating a TextMarkup leads to an inaccurate Box of the TextMarkup


Im working with PDFClown to analyze and work with PDFDocuments. My aim is to highlight all numbers within a table. For all numbers which belong together (For example: All numbers in one column of a table) I will create one TextMarkup with a List of Quads. First of all it looks like everythink work well: All highlights on the left belong to one TextMarkup and all Highlights on the right belong to another TextMarkup.

HighlightedText

But when analyzing the size of the TextMarkup the size is bigger than it looks at the picture. So when drawing for example a rectangle arround the left TextMarkup box the rectangle intersects the other column despite no highlight of the left TextMarkup intersects the other column. Is there a way to optimize the Box of the TextMarkup? I think there is a bulbous ending of the box so that the box is intersecting the other TextMarkup

This is the code which creates the TextMarkup:

List<Quad> highlightQuads = new ArrayList<Quad>();
for (TextMarkup textMarkup : textMarkupsForOneAnnotation) {
    Rectangle2D textBox = textMarkup.getBox();
    Rectangle2D.Double rectangle = new Rectangle2D.Double(textBox.getX(), textBox.getY(), textBox.getWidth(), textBox.getHeight());
    highlightQuads.add(Quad.get(rectangle));
}

if (highlightQuads.size() > 0) {

    TextMarkup _textMarkup = new TextMarkup(pagesOfNewFile.get(lastFoundNewFilePage).getPage(), highlightQuads,"", MarkupTypeEnum.Highlight);       
    _textMarkup.setColor(DeviceRGBColor.get(Color.GREEN));
    _textMarkup.setVisible(true);
    allTextMarkUps.add(_textMarkup);
}

Here is an example file Example

Thank You !!


Solution

  • Your code is not really self contained (I cannot run it as it in particular misses the input data), so I could only do a bit of PDF Clown code analysis. That code analysis, though, did indeed turn up a PDF Clown implementation detail that would explain your observation.

    How does PDF Clown calculate the dimensions of the markup annotation?

    The markup annotation rectangle must be big enough to include all quads plus start and end decorations (rounded left and right caps on markup rectangle).

    PDF Clown calculates this rectangle as follows in TextMarkup:

      public void setMarkupBoxes(
        List<Quad> value
        )
      {
        PdfArray quadPointsObject = new PdfArray();
        double pageHeight = getPage().getBox().getHeight();
        Rectangle2D box = null;
        for(Quad markupBox : value)
        {
          /*
            NOTE: Despite the spec prescription, Point 3 and Point 4 MUST be inverted.
          */
          Point2D[] markupBoxPoints = markupBox.getPoints();
          quadPointsObject.add(PdfReal.get(markupBoxPoints[0].getX())); // x1.
          quadPointsObject.add(PdfReal.get(pageHeight - markupBoxPoints[0].getY())); // y1.
          quadPointsObject.add(PdfReal.get(markupBoxPoints[1].getX())); // x2.
          quadPointsObject.add(PdfReal.get(pageHeight - markupBoxPoints[1].getY())); // y2.
          quadPointsObject.add(PdfReal.get(markupBoxPoints[3].getX())); // x4.
          quadPointsObject.add(PdfReal.get(pageHeight - markupBoxPoints[3].getY())); // y4.
          quadPointsObject.add(PdfReal.get(markupBoxPoints[2].getX())); // x3.
          quadPointsObject.add(PdfReal.get(pageHeight - markupBoxPoints[2].getY())); // y3.
          if(box == null)
          {box = markupBox.getBounds2D();}
          else
          {box.add(markupBox.getBounds2D());}
        }
        getBaseDataObject().put(PdfName.QuadPoints, quadPointsObject);
    
        /*
          NOTE: Box width is expanded to make room for end decorations (e.g. rounded highlight caps).
        */
        double markupBoxMargin = getMarkupBoxMargin(box.getHeight());
        box.setRect(box.getX() - markupBoxMargin, box.getY(), box.getWidth() + markupBoxMargin * 2, box.getHeight());
        setBox(box);
    
        refreshAppearance();
      }
    
      private static double getMarkupBoxMargin(
        double boxHeight
        )
      {return boxHeight * .25;}
    

    So it takes the bounding box of all the quads and adds left and right margins each as wide as a quarter of the height of this whole bounding box.

    What is the result in your case?

    While this added margin width is sensible if there is only a single quad, in case of your markup annotation which includes many quads on top of one another, this results in a giant, unnecessary margin.

    How to improve the code?

    As the added caps depend on the individual caps and not their combined bounding box, one can improve the code by using the maximum height of the individual quads instead of the height of the bounding box of all quads, e.g. like this:

    Rectangle2D box = null;
    double maxQuadHeight = 0;
    for(Quad markupBox : value)
    {
      double quadHeight = markupBox.getBounds2D().getHeight();
      if (quadHeight > maxQuadHeight)
        maxQuadHeight = quadHeight;
      ...
    }
    ...
    double markupBoxMargin = getMarkupBoxMargin(maxQuadHeight);
    box.setRect(box.getX() - markupBoxMargin, box.getY(), box.getWidth() + markupBoxMargin * 2, box.getHeight());
    setBox(box);
    

    If you don't want to patch PDF Clown for this, you can also execute this code (with minor adaptations) after constructing the TextMarkup _textMarkup to correct the precalculated annotation rectangle.

    Is this fixing a PDF Clown error?

    It is not an error as there is no need for the text markup annotation rectangle to be minimal; PDF Clown could also always use the whole crop box for each such annotation.

    I would assume, though, that the author of the code wanted to calculate a somewhat minimal rectangle but only optimized for single line and so in a way did not live up to his own expectations...

    Are there other problems in this code?

    Yes. The text a markup annotation marks needs not be horizontal, it may be there at an angle, it could even be vertical. In such a case some margin would also be needed at the top and the bottom of the annotation rectangle, not (only) at the left and the right.