Search code examples
javapdf

How to extract data from a specific rectangular area in a PDF using Java?


I am trying to extract data from a specific rectangular region specified by two coordinates given inside a PDF. Is it possible to do this in a PDF or would I have to convert it into a image and use OCR? If so, does PDFBox or iText include a way to analyze images via OCR? Thanks!

Bank Statement


Solution

  • If the area is text. use pdfbox,

    PDDocument document = PDDocument.load(new File("target.pdf"));
    PDFTextStripperByArea stripper = new PDFTextStripperByArea();
    stripper.setSortByPosition(true);
    Rectangle rect = new Rectangle(35, 375, 340, 204);
    stripper.addRegion("class1", rect);
    stripper.extractRegions(document.getPage(1));
    System.out.println(stripper.getTextForRegion("class1"))