Search code examples

How to convert or extract a table from an image using Tesseract?

I have the following image of a table (pandas dataframe or excel sheet),enter image description here

I just started using tesseract but I'm having problems converting it into a table.

I'm using the following code.

img_cv = cv2.imread(imagepath)
img_rgb = cv2.cvtColor(img_cv,cv2.COLOR_BGR2RGB)

But words and letters are recognized but the formatting is all off and the words come out in a chunk and all jumbled.

'IN ETaat=) Count... Tkr & Exch Market Sales %ReventRelationshi Account %Cost Source As Of Date\n\nCap Surprise Value (Q) As Type\n\n21) Facebook Inc LUIS} las) LOS 516.19B) 0.93%\n\n39) Applied Optoelectro...|US AAOI US 177.83M 1.77% 10.90% 5.20M|\\CAPEX 0.14%|*2019A CF 02/28/2020\n40) Activision Blizzard ...|US ATVI US 46.13B 0.89%, 0.31%) 4.02M|COGS 0.13%|Estimate 12/03/2019\n41) Quanta Computer I... |TW 2382 an 7.93B| -2.73% 0.04% 3.02M/COGS 0.11%|Estimate 07/04/2019\n42) Modern Avenue Gro...|CN 002656 CH) 263.51M| -2.87%| 4.44% 2.60M|\\COGS 0.10%|*2018A CF 04/26/2019\n43) Mellanox Technolog...|IL MLNX US 6.51B| 13.57%| 0.74%) 2.80M|\\COGS (OM O}=1<1 tim [nate] k=) 03/03/2020\n44) O-Net Technologies...|CN 877 ale 463.33M aad 3.11%) 2.49M|CAPEX 0.07%|Estimate 10/30/2019\n45) Adobe Inc US ADBE US 162.75B 0.63%, 0.08% 2.02M|\\SG&A 0.07%|Estimate 06/12/2019\n46) British Land Co PLC...\\|GB BLND LN 5.74B| 10.97% 1.05% 2.12M\\SG&A (OM Oley atin [nat] k=) 11/19/2019\n47) Bel Fuse Inc US BELFA US | 123.22M) -3.66% 1.13% 1.40M/COGS (omer tl at-im [gate] k=) 11/19/2019\n48) Keysight Technolog...|US Nees US 17.99B 3.37%, 0.08% 880.90k/\\COGS (OM Oey a-imeat- 1K) 01/03/2020\n49) BT Group PLC GB BT/A LN 17.00B|} -0.01% 0.01% 631.65k/COGS (om OP2-1) at-1 8 [gate] K=) 01/16/2020\n50) KT Corp KR 030200 KS 5.21B 0.32%, 0.02% 1.07M|SG&A (om OP2-1) at-1 8 [gate] K=) 05/10/2019\n5D Sunny Optical Tech... |CN 2382 ale 18.16B aad 0.04% 425.69k/ COGS (om eM Rati m [nat] -) 08/27/2019\n52) Belden Inc US 131 D1@% US 1.95B 5.68%, 0.04%) 255.50k|COGS (om eM Rati m [nat] -) 11/04/2019\n53) Lattice Semiconduc... |US LSCC US 2.51B 0.24%, 0.18%) 174.54k COGS (om eM Rati m [nat] -) 05/08/2019\n54 Zhen Ding Technolo.../TW 4958 an 3.55B| -0.77%| 0.02%) 184.75k/COGS (om eM Rati m [nat] -) 01/17/2020\n55) Emnet Inc KR 123570 KS 66.79M aid Pa hei) 214.59k|SG&A *2019C3 CF 11/14/2019\n56) Zebra Technologies...|US ZBRA US 10.95B| -0.32% 57.18k\\COGS stim [eat] k=) 02/21/2020'

Is there a way to get it to a table format properly?


  • It's horizontally compressed so you can resize the height dimension and it mostly works; I augmented the vertical dimension by ~25%, and added ~10% to the horizontal dimension.

    img_resized = cv2.resize(img_cv,
                             (int(img_cv.shape[1] + (img_cv.shape[1] * .1)),
                              int(img_cv.shape[0] + (img_cv.shape[0] * .25))),
    img_rgb = cv2.cvtColor(img_resized,cv2.COLOR_BGR2RGB)


    In [42]: print(pytesseract.image_to_string(img_rgb))                                                
    vente) Count... Tkr & Exch Market Sales %ReventRelationshiAccount %Cost Source As Of Date
    Cap Surprise Value (Q) As Type
    21) Facebook Inc US FB US 516.19B) 0.93%
    39) Applied Optoelectro...|US AAOI US | 177.83M| 1.77%| 10.90% 5.20M|\CAPEX 0.14%|*2019A CF 02/28/2020
    40) Activision Blizzard ...|US ATVI US 46.13B) 0.89% 0.31% 4.02M|\COGS 0.13%|/Estimate 12/03/2019
    41) Quanta Computer I... |TW 2382 TT 7.93B| -2.73%| 0.04% 3.02M COGS 0.11%|/Estimate 07/04/2019
    42) Modern Avenue Gro... |CN 002656 CH! 263.51M -2.87%| 4.44% 2.60M|\COGS 0.10%|*2018A CF 04/26/2019
    43) Mellanox Technolog...|IL MLNX US 6.51B) 13.57%, 0.74% 2.80M|COGS 0.08%|/Estimate 03/03/2020
    44) O-Net Technologies...|CN 877 HK | 463.33M --| 3.11% 2.49M\CAPEX 0.07%|Estimate 10/30/2019
    45) Adobe Inc US ADBE US| 162.75B) 0.63%, 0.08% 2.02M SG&A 0.07%|Estimate 06/12/2019
    46) British Land Co PLC...|GB BLND- LN 5.74B) 10.97%, 1.05% 2.12M SG&A 0.06%|Estimate 11/19/2019
    47) Bel Fuse Inc US BELFA US | 123.22M -3.66%| 1.13% 1.40M|\COGS 0.04%|Estimate 11/19/2019
    48) Keysight Technolog...|US KEYS US 17.99B| 3.37% 0.08% 880.90k|COGS 0.03%|Estimate 01/03/2020
    49) BT Group PLC GB BT/A LN 17.00B| -0.01%| 0.01% 631.65k/COGS 0.02%|/Estimate 01/16/2020
    50) KT Corp aoe 030200 KS 5.21B) 0.32% 0.02% 1.07M|SG&A 0.02%|/Estimate 05/10/2019
    51) Sunny Optical Tech... |CN 2382 HK 18.16B --| 0.04% 425.69k/COGS 0.01%|/Estimate 08/27/2019
    52) Belden Inc US BDC US 1.95B) 5.68% 0.04% 255.50k/|COGS 0.01%|/Estimate 11/04/2019
    53) Lattice Semiconduc...|US Lscc US 2.51B) 0.24% 0.18% 174.54k|COGS 0.01%|/Estimate 05/08/2019
    54) Zhen Ding Technolo..., TW 4958 TT 3.55B) -0.77%| 0.02% 184.75k/COGS 0.01%|/Estimate 01/17/2020
    55) Emnet Inc KR 123570 KS| 66.79M --| 2.78% 214.59k/SG&A *2019C3 CF Wary esenke,
    56) Zebra Technologies...|US ZBRA US 10.95B) -0.32% 57.18k|COGS Estimate 02/21/2020

    To write this to an output file do:

    output = pytesseract.image_to_string(img_rgb)
    with open('test.csv','w') as f: 