Search code examples
pythonpython-3.xformattesseractpython-tesseract

Format issue with the output file of pytesseract


I was trying to extract text from a image using pytesseract.

I want the output file to be in the same format the image being processed.

By format I mean the output text to be arranged in rows and columns as the input image.

I have tried the following code but the output file looks nothing like the input but the text recognition is mostly accurate.

Code

import pytesseract
from pytesseract import Output
from PIL import Image
import pandas as pd

pytesseract.pytesseract.tesseract_cmd = r'/usr/bin/tesseract'

custom_config = r'-c preserve_interword_spaces=1 --oem 1 --psm 1 -l eng'
d = pytesseract.image_to_data(Image.open(r'_0.png'), config=custom_config, output_type=Output.DICT)
df = pd.DataFrame(d)

# clean up blanks
df1 = df[(df.conf!='-1')&(df.text!=' ')&(df.text!='')]
# sort blocks vertically
sorted_blocks = df1.groupby('block_num').first().sort_values('top').index.tolist()
for block in sorted_blocks:
    curr = df1[df1['block_num']==block]
    sel = curr[curr.text.str.len()>3]
    char_w = (sel.width/sel.text.str.len()).mean()
    prev_par, prev_line, prev_left = 0, 0, 0
    text = ''
    for ix, ln in curr.iterrows():
        # add new line when necessary
        if prev_par != ln['par_num']:
            text += '\n'
            prev_par = ln['par_num']
            prev_line = ln['line_num']
            prev_left = 0
        elif prev_line != ln['line_num']:
            text += '\n'
            prev_line = ln['line_num']
            prev_left = 0

        added = 0  # num of spaces that should be added
        if ln['left']/char_w > prev_left + 1:
            added = int((ln['left'])/char_w) - prev_left
            text += ' ' * added 
        text += ln['text'] + ' '
        prev_left += len(ln['text']) + added + 1
    text += '\n'
    print(text)

Input Image

enter image description here

Output

enter image description here


Solution

  • First of all - remove noise -> it will produce extra errors.

    Next try different output. e.g. hocr is html/xml output with bounding boxes info, so you can get exact position on screen for OCR result.

    If you do not need exact position, maybe easier would be postprocesing of txt output. E.g. tesseract 5 and tessdata_best produce this output

    $ tesseract YaVQ3.jpg - --psm 6 --dpi 300 -c preserve_interword_spaces=1
    
    2
    wf
    10020 Knut Bratli, Brandval          P.b. Chrysler       1936
    10033 Erland Berg, Gjes&sen         P.b. Dodge        1939
    10054 Edvart Sandmo, Gardvik         P.b. Opel          1937
    10057 Hjalmar Aanerud, Vinger        P.b. Opel           1932
    10075 Reidar Holth, Flisa                P.b. Volvo         . 1960
    10076 Einar Bredalen, Braskereidfoss   P.b. Dodge        1929
    10077 Reidar Holth, Flisa            P.b. Volkswagen    1961
    10089 Sor-Odal Bulldozerdrift, Skarnes Lb. White         1944   "
    10090 Arne Radford, Galterud            Lb. Ford            1939
    10093 Sverre Langbraten, Brandval       L.b. Citroén          1950
    10096 Karl Tuhus, Skotterud          P.b. Chrysler       1936
    10101 Gunnar Bie-Larsen, Kongsvinger P.b. Ford    :   ©1961
    10110 Martin Albertsen, Flisa           Pb. Opel       .   1960
    10111 Alf @degaard, Kongsvinger         P.b. Volkswagen      1958
    10112 Asbjern Elverhoi, Kongsvinger    Pb. Ford          1961
    10114 Olav Sunde jr., Skarnes       ¢    P.b. Plymouth       1937
    10116 John Erichsen, Skarnes          P.b. Ford          1960
    10118 Ole Hasleengen, Véler    \        Pb. Morris         1931
    10120 Harald Eggen, Vinger    \       P.b. Peugeot        1938
    10121 Ola N. Berg, Gjesisen             Pb. Ford            1960
    10125 Reldar Rapstad, Roverud            Pb. Ford             1954     Pp
    10129 Erling Johnsrud, Skarnes           Pb. Overland         1939
    10130 Reidar Vangen, Disend          P.b. Hudson        1947      v
    10133 Oddvar Lilleseth, Skarnes      V.b. Ford        1934
    10136 Hans K. Kolbjornsrud, Austmarka P.b. Volvo         1939
    10140 Rolv Snare, Kongsvinger         P.h. Mercedes Benz 1950
    10143 Olaf Storberget, Grue Finnskog L.b. Land Rover    1951
    10146 Helge Strand, Magnor            P.b. Hudson         1946
    10148 Arne Hagan, Brandval             Pb. Volkswagen’    1957
    10159 Brodbelfoss, E.verk, Vinger        P.b. Chevrolet        1939
    10160 Lauritz Hove, Sander           Pb. Ford          1959
    10161 Rolf Johnsen, Matrand           Lb. Ford         * 1937
    10168 Sten Sooth Knutsen, Skotterud    Pb. Volkswagen     1962
    10170 Odd Norli, Knapper               P.b. Buick           1938
    10175 Gustav Solvang, Kongsvinger     L.b. Chevrolet       1939         4
    10180 Trygve Wolden, Kongsvinger    Pb. Dodge        1920
    10182 Kongsv. Handelsgartneri, Kongsv. Stb. Opel            1957
    10186 Oddvar Berget, Namni             Lb. Fordson         1933
    10188 Sander Idrettslag, Sander     .    Buss Austin      +1951
    10185 Karl O. Halvorsen, Br.foss        L.b. Hanomag       1955
    NN                                    -
    3
          :                                 ll
    v                                -—