I have a requirement to extract a part of pdf page using the user-specified coordinates. So far, I have used cropbox from PDFBox to create a cropbox at the desired location
document_ = new PDDocument();
document_.addPage(page_);
page_.setCropBox(new PDRectangle(startX,startY,width,pageHeight));
This gives the expected pdf clipped from the page when I save the document. But when I try to get all the text from the document using PDFTextStripper it returns me all the text outside the cropbox also.
I also tried PDFTextStripperByArea the list of Text returned is invalid. I am using the below code
super.addRegion("test", document.getPage(0).getCropBox().toGeneralPath().getBounds2D());
super.extractRegions(document.getPage(0));
super.getTextForRegion("test");
What is the mistake here? how do I properly extract the text only inside the cropbox
I resolved this by manually checking text and image contents. If they are inside cropbox by comparing the coordinates.
if((textItem.getStartXPos() + this.cropbox.getLowerLeftX()) >= this.cropbox.getLowerLeftX() &&
(textItem.getEndXPos() + this.cropbox.getLowerLeftX()) <= (this.cropbox.getLowerLeftX() + this.cropbox.getWidth()) &&
this.cropbox.getLowerLeftY() <= (this.cropbox.getLowerLeftY()+textItem.getStartYPos())
&&(this.cropbox.getLowerLeftY()+textItem.getStartYPos())<=this.cropbox.getUpperRightY())
this.pageData.addTextItem(textItem);