Apache PdfBox: Confusion about coordinates

I try to extract some text out of a PDF. For that I need to define a rectangle that contains the text.

I recognized that the coordinates may have a different meaning when I compare the coordinates from extraction of text to coordinates of drawing.

package MyTest.MyTest;

import org.apache.pdfbox.pdmodel.*;
import org.apache.pdfbox.pdmodel.PDPageContentStream.*;
import org.apache.pdfbox.text.*;
import java.awt.*;
import java.io.*;

public class MyTest 
{   
  public static void main (String [] args) throws Exception
  { 
    PDDocument pd = PDDocument.load (new File ("my.pdf"));  
    PDFTextStripperByArea st = new PDFTextStripperByArea ();
    PDPage pg = pd.getPage (0);

    float h = pg.getMediaBox ().getHeight ();
    float w = pg.getMediaBox ().getWidth ();
    System.out.println (h + " x " + w + " in internal units");
    h = h / 72 * 2.54f * 10;
    w = w / 72 * 2.54f * 10;
    System.out.println (h + " x " + w + " in mm");



    int X = 85;
    int Y = 175;
    int dX = 250;
    int dY = 15;

    // extract some text
    st.addRegion ("a", new Rectangle (X, Y, dX, dY));
    st.extractRegions (pg);
    String text = st.getTextForRegion ("a");
    System.out.println("text="+text);


    // fill a rectangle
    PDPageContentStream contents = new PDPageContentStream (pd, pg,AppendMode.APPEND, false);
    contents.setNonStrokingColor (Color.RED);  
    contents.addRect (X, Y, dX, dY);
    contents.fill ();
    contents.close ();
    pd.save ("x.pdf");
  }
}

The text I extract (output of text= in the console) is not the text I overdraw with my red rectangle (generated x.pdf).

Why??

For testing try some PDF you already have. To avoid a lot of try/error in aiming for a rectangle with text in it use a file with a lot of text.

Solution

There are (at least) two issues in your approach:

Different coordinate systems

You use st.addRegion. Its JavaDoc comment tells us:

/**
 * Add a new region to group text by.
 *
 * @param regionName The name of the region.
 * @param rect The rectangle area to retrieve the text from. The y-coordinates are java
 * coordinates (y == 0 is top), not PDF coordinates (y == 0 is bottom).
 */
public void addRegion( String regionName, Rectangle2D rect )

(Actually the whole text extraction apparatus of PDFBox uses its own coordinate system, and there already have been many questions on stack overflow because of irritations this caused.)

On the other hand contents.addRect does not use those "java coordinates". Thus, you have to subtract the y coordinate you use in text extraction from the maximum crop box y coordinate to get a coordinate for addRect.

Furthermore, the region rectangles have their anchor point at the top left while the regular PDF rectangles (like the one you define with contents.addRect) have it at the bottom left. Thus, you additionally have to add or subtract the rectangle height from the y coordinate.

Actually you may have to change the x coordinate, too. It is not mirrored but there may be a shift, the PDFBox text extraction coordinate system uses x=0 for the left page border but that is not necessarily the case in PDF user space. Thus, you may have to add the left border x coordinate of the crop box to your text extraction x coordinate.

Possibly changed coordinate system

In the page content stream the coordinate system may have been changed by applying a transformation to the current transformation matrix. As a result the coordinates in the instructions you append to it may have a different meaning than even outlined above.

To rule out such an effect, you should use a different PDPageContentStream constructor with an additional boolean resetContext parameter:

/**
 * Create a new PDPage content stream.
 *
 * @param document The document the page is part of.
 * @param sourcePage The page to write the contents to.
 * @param appendContent Indicates whether content will be overwritten, appended or prepended.
 * @param compress Tell if the content stream should compress the page contents.
 * @param resetContext Tell if the graphic context should be reset. This is only relevant when
 * the appendContent parameter is set to {@link AppendMode#APPEND}. You should use this when
 * appending to an existing stream, because the existing stream may have changed graphic
 * properties (e.g. scaling, rotation).
 * @throws IOException If there is an error writing to the page contents.
 */
public PDPageContentStream(PDDocument document, PDPage sourcePage, AppendMode appendContent,
                           boolean compress, boolean resetContext) throws IOException

I.e. replace

PDPageContentStream contents = new PDPageContentStream (pd, pg,AppendMode.APPEND, false);

PDPageContentStream contents = new PDPageContentStream (pd, pg,AppendMode.APPEND, false, false);