Search code examples
javapdfpdf-scraping

Extract data from PDF document


I have a PDF document.

It contains data in tabular format. I want to extract the data into a comma delimited text file using the comma as column delimiters.

Any suggestions?


Solution

  • Standard PDFs do not provide any hints about the semantics of what they draw on a page: the only distinction that the syntax provides is the distinctions between vector elements (lines, fills,...), images and text.

    Whether any character is part of a table or part of a line or just a lonely, single character within an otherwise empty area is not easy to recognize programmatically by parsing the PDF source code.

    For a background about why the PDF file format should never, ever be thought of as suitable for hosting extractable, structured data, see this article:

    Why Updating Dollars for Docs Was So Difficult (ProPublica-Website)

    Having said the above now let me add this:

    Tabula is written in Ruby.


    Update

    Here is an ASCiinema screencast (which you also can download and re-play locally in your Linux/MacOSX/Unix terminal with the help of the asciinema command line tool), starring tabula-extractor:

    asciicast