Search code examples
javapdfbox

Pdfbox PDFTextStripperByArea coordinates shifted


I am having issues with coordinates. The PDFTextStripperByArea region seems to be pushed too high.

Consider the following example snippet:

...
PDPage page = (PDPage) allPages.get(0);
PDFTextStripperByArea stripper = new PDFTextStripperByArea();

// define region for extraction -- the coordinates and dimensions are x, y, width, height
Rectangle2D.Float region = new Rectangle2D.Float(x, y, width, height);
stripper.addRegion("test region", region);

// overlay the region with a cyan rectangle to check if I got the coordinates and dimensions right 
PDPageContentStream contentStream = new PDPageContentStream(document, page, true, true);
contentStream.setNonStrokingColor( Color.CYAN );
contentStream.fillRect(x, y, width, height );
contentStream.close();

// extract the text from the defined region
stripper.extractRegions(page);
String content = stripper.getTextForRegion("test region"); 
... 
document.save(...); ...

The cyan rectangle overlays the desired region nicely. On the other hand, stripper misses a couple of lines at the bottom of the rectangle and includes couple of lines above the rectangle -- it looks like it is shifted "upwards" (by y coordinate). What is going on?


Solution

  • Text is usually contained inside a positioning rectangle. Sometimes, the text is not at the expected position inside that rectangle, and PDFBox uses that rectangle to try and guess where the text is located. So if text starts outside the capture area and flows into it, it might not be extracted.

    Rough sketch: Textbox starts outside the capture area but text flows inside it. It might not be captured.

    ____________
    |Page      |
    |   _______|
    |   |Area ||
    |   |     ||
    | ..|.....||
    | ⁞ |Text⁞||
    | ⁞ |____⁞||
    | ⁞......⁞ |
    |__________|