I need to extract data from similarly formatted tables from this file. There are some OCR errors but I have an automated method to correct them.
I have tried:
The Problem: The commercials tools are very bad with detecting the edges of the table. The tables follow a similar general format but each scan is aligned slightly differently so hard coding the boarders won't work either.
Question: Do you guys know a good way to detect where the table begins and then apply one of a few templates?
Any other tips for this kind of work are greatly appreciated.
UPDATE 2/26: I solved my own question, though feel free to respond with fast or better solutions.
One of the main problems is that the tables are roughly similar in their dimensions but they vary from page to page. The scanned images are also slightly offset from page to page, giving two alignment problems. My current workflow solves both and is as follows.
Solution:
The images of the same table type are still not aligned so specifying a table layout in (x,y) coordinates won't work. The tables locations are in different in each image.
I needed to align the images based on the table location, but without already detecting the table there was no good way to do that.
I solved the problem in an interesting way, but I tried the following steps first.
Solution:
After having cut images into tables explained in Table Type Alignment section, use the Auto align layers feature in Photoshop to align the images.
Step-by-Step Solution:
Done! Combine the files for each table however you like. I will post my python code for doing this when I'm done with the project. Once cleaned, I will post the data too.