Search code examples
pythonmachine-learningdeep-learningneural-networkocr

OCR PDF image to Excel by template


enter image description hereI need to convert a lot PDF tables data scans with bad quality to excel tables. The only way I see the solution is to train tesseract or some other framework on pre-generated images(all tables in PDF are the same in most cases). Is it real to have a great solution around 70-80% at home conditions and what you can advice. I will appreciate any advice other than Abby FineReader or similar solution(tested on my dataset - result is so bad and few opportunities for automation)

All tables structures need to be correct in result for further handwork.


Solution

  • You should use a PDF parser for that.

    Here's the parsed result using Parsio (https://parsio.io). It looks correct to me. You can export the parsed data to Sheets / Excel / CSV / Zapier.

    enter image description here