Search code examples
linuxpdfpdfminer

How to extract text from PDF according to its location?


I have multiple PDFs and I want to extract text from a certain region from their first pages. So, given I have the coordinates for the bounding box for the text in the PDF, how do I extract that text using command line.

I researched a bit and found that PDFMiner and PDFBox can do this. But PDFMiner is very poorly documented.

Can someone tell me how to do this using PDFMiner? OR if you could suggest some other solution?

PS: I am on Linux Terminal.


Solution

  • pdftotext (take one of the latest, Poppler-based versions) does let you define a page region to extract text from.

    Try this:

    pdftotext    \
      -f 5       \
      -l 7       \
      -x 200     \
      -y 700     \
      -W 144     \
      -H 80      \
       input.pdf \
       output.txt
    

    It selects page range 5-7, and a rectangle of width = 144 points (72 points == 1 inch), height = 80 points where the top left corner is at x-coordinate 200, and y-coordinate 700.