Search code examples
pythonparsingpdfpdfboxapache-tika

parse tables from a PDF document


The PDF in this link (http://www.lenovo.com/psref/pdf/psref450.pdf) contains a number of tables like this:

enter image description here

I'd like to programmatically extract the data and the structure from these tables.

Things I've tried: converting the PDF to HTML using

  1. Tika: Unfortunately, the tables are converted to space delimited paragraphs - and some of the strings contain spaces so it's notpossible to split them.
  2. Python's PDFMiner: returned an assertion error due to missing fonts. I suspect the HTML would have been similar to the output from Tika,though I'll need to resolve the issue with the missing fonts to confirm this.
  3. Online tools: I tried http://www.zamzar.com/ and a couple of others. The file was either too big to process (for the online services) or it generated errors.

I was planning to convert the PDF to HTML and then parse it with BeautifulSoup.

The output could be JSON (e.g. one object per table), XML, or pretty much any format that maintains the structure.


Solution

  • You could try PDFBox. The documentation for that is here:

    https://pdfbox.apache.org/1.8/cookbook/textextraction.html

    Extend org.apache.pdfbox.pdfviewer.PDFPageDrawer and override the strokePath method. From there you can intercept the drawing operations for horizontal and vertical line segments and use that information to determine the column and row positions. You can set up text regions to determine which numbers/letters/characters are drawn in which region. Since you know the layout of the regions are tabular you'll be able to define tables and tell which column and row the extracted text belongs to using simple algorithms.