Search code examples
imagetype-conversiongifods

How to convert image to table


I have an image of a table (in my case .gif) and want to extract the table it was (ideally, .ods).

Is there any way to do so? (doing it manually is discarted, since the table has more than 1000 rows and 6 columns)

Here is a part of the image / table: enter image description here


Solution

  • You will be able to get most of it through OCR, but you'll need to manually verify the data and fix some inaccuracies that will be there. It definitely won't be perfect.

    First thing to do is to ensure you have a good quality image for the OCR software:

    Here's what I did with your sample png (I'm using Windows):

    1. I opened the image in The Gimp.
    2. Removed the orange/blue backgrounds:

      a) Select -> By Color and clicked the blue background

      b) I held down Shift and clicked the orange background (this will add it to the current selection)

      c) Edit -> Fill With BG Color (this sets it to white)

      d) Ctrl-Shift-A to cancel the selection

    3. I removed the partially cut off '305' line:

      a) used the Rectangular Select tool button from the palette, and filled the selection with BG Color, as above

    4. Let's remove the table border:

      a) Click the 'Fuzzy Select' tool button from the palette

      b) Click somewhere on the table border (you should see the 'marching ants' instead of the border)

      c) Edit -> Fill With BG Color

      d) Ctrl-Shift-A to cancel the selection again

    5. We need to increase the number of pixels that the numbers use so that the OCR can better detect their shapes

      a) Image -> Scale Image. I chose to scale by 1000% with Linear Interpolation (the other interpolations won't work as well)

    6. Download and install Tesseract from GitHub

      a) At the command prompt type (include the double-quotes to cope with spaces within the path, & change your paths as necessary): "D:\Program Files (x86)\Tesseract-OCR\tesseract" "d:\temp\your_image.png" "d:\temp\your_txt_file_output"

    7. The output with be a text file with an appended .txt extension. It will still have a few artifacts but we can easily correct those in Notepad++ (or similar):

      a) The commas were seen as full-stops, so I did a Find and Replace of "." with "," (I'm assuming you don't have any decimal points in the data!)

      b) There were some spaces before a few commas, so I did Find and Replace " ," with "," (note I included a space before the comma in the Find)

      c) There were still some spaces in the numbers, so I did a Find and Replace of " " with "" (a space with an empty replace)

    This gave the following result:

    298
    299
    300
    301
    302
    303
    304

    910,820,000
    920,820,000
    930,820,000
    941,820,000
    952,820,000
    983,820,000
    9?4,820,000

    210,000
    220,000
    220,000
    220,000
    220,000
    220,000
    220,000

    2,500
    2,500
    3,000
    3,000
    3,000
    3,000
    3,000

    19,000
    19,000
    20,000
    20,000
    20,000
    20,000
    20,000

    Note the question mark in the place of 7 in the second block of text. Things like that still need to be tidied up.

    Lastly, you'd copy and paste the rows of text into your spreadsheet etc.