Search code examples
pdfocrdata-extractionpdf-extractionpython-camelot

How do I extract tables from a historical PDF?


I need to extract data from similarly formatted tables from this file. There are some OCR errors but I have an automated method to correct them.

I have tried:

  • ABBYY Finereader table detection.
  • Tabula table extraction
  • Camelot table extraction
  • Custom python code

The Problem: The commercials tools are very bad with detecting the edges of the table. The tables follow a similar general format but each scan is aligned slightly differently so hard coding the boarders won't work either.

Question: Do you guys know a good way to detect where the table begins and then apply one of a few templates?

Any other tips for this kind of work are greatly appreciated.


Solution

  • UPDATE 2/26: I solved my own question, though feel free to respond with fast or better solutions.

    One of the main problems is that the tables are roughly similar in their dimensions but they vary from page to page. The scanned images are also slightly offset from page to page, giving two alignment problems. My current workflow solves both and is as follows.

    Table Type Alignment

    Solution:

    1. Use the image editing tools in ABBYY to cut each page horizontally. This gives one table on each page.
    2. Note that there are 4 table types. Even pages and odd pages have separate layouts. The first table on each page includes a field for date.
    3. That gives first-table-even, first-table-odd, reg-table-even, reg-table-odd. Processing one type at a time with fixed table areas and columns fixes misalignment due to differences in the tables layouts.

    Image Alignment

    The images of the same table type are still not aligned so specifying a table layout in (x,y) coordinates won't work. The tables locations are in different in each image.

    I needed to align the images based on the table location, but without already detecting the table there was no good way to do that.

    I solved the problem in an interesting way, but I tried the following steps first.

    1. Detect vertical lines using Opencv. Result: did not detect faint lines well. Would often miss lines making it useless for alignment.
    2. Use Scan Tailor to detect content. Result: The detection algorithm would crop some tables too much in some files and in others include white space because of specks in the image. Despeckling didn't help.
    3. Use Camelot with wide table areas, no column values. Result: This would probably work well in other cases but Camelot fell down here. The data is reported to down to cents and there are spaces between every three digits. This resulted in the misplacement of the 00 in several columns.

    Solution:

    After having cut images into tables explained in Table Type Alignment section, use the Auto align layers feature in Photoshop to align the images.

    Step-by-Step Solution:

    1. Open Photoshop
    2. Load images of one table type into a single file using: File-Scripts-Load Files into Stack
    3. Use: Edit-Auto-align layers
    4. Use crop tool to make each file the same size.
    5. Export each image as its own file: File-Export-Layers to files
    6. Use ABBYY OCR editor on each of the 4 table types, hardcode the columns and rows using GUI editor.
    7. Export to CSV from ABBYY
    8. Use something like clean.py to remove spaces and bad chars.

    Done! Combine the files for each table however you like. I will post my python code for doing this when I'm done with the project. Once cleaned, I will post the data too.