Search code examples
pythoncomputer-visionocrtesseractpython-tesseract

Preserving indentation with Tesseract OCR 4.x


I'm struggling with Tesseract OCR. I have a blood examination image, it has a table with indentation. Although tesseract recognizes the characters very well, its structure isn't preserved in the final output. For example, look the lines below "Emocromo con formula" (Eng. Translation: blood count with formula) that are indented. I want to preserve that indentation.

I read the other related discussions and I found the option preserve_interword_spaces=1. The result became slightly better but as you can see, it isn't perfect.

Any suggestions?

Update:

I tried Tesseract v5.0 and the result is the same.

Code:

Tesseract version is 4.0.0.20190314

from PIL import Image
import pytesseract

# Preserve interword spaces is set to 1, oem = 1 is LSTM, 
# PSM = 1 is Automatic page segmentation with OSD - Orientation and script detection

custom_config = r'-c preserve_interword_spaces=1 --oem 1 --psm 1 -l eng+ita'

# default_config = r'-c -l eng+ita'

extracted_text = pytesseract.image_to_string(Image.open('referto-1.jpg'), config=custom_config)

print(extracted_text)

# saving to a txt file

with open("referto.txt", "w") as text_file:
    text_file.write(extracted_text)

Result with comparison:

Result

GITHUB:

I have created a GitHub repository if you want to try it yourself.

Thanks for your help and your time


Solution

  • image_to_data() function provides much more information. For each word it will return it's bounding rectangle. You can use that.

    Tesseract segments the image automatically to blocks. Then you can sort block by their vertical position and for each block you can find mean character width (that depends on the block's recognized font). Then for each word in the block check if it is close to the previous one, if not add spaces accordingly. I'm using pandas to ease on calculations, but it's usage is not necessary. Don't forget that the result should be displayed using monospaced font.

    import pytesseract
    from pytesseract import Output
    from PIL import Image
    import pandas as pd
    
    custom_config = r'-c preserve_interword_spaces=1 --oem 1 --psm 1 -l eng+ita'
    d = pytesseract.image_to_data(Image.open(r'referto-2.jpg'), config=custom_config, output_type=Output.DICT)
    df = pd.DataFrame(d)
    
    # clean up blanks
    df1 = df[(df.conf!='-1')&(df.text!=' ')&(df.text!='')]
    # sort blocks vertically
    sorted_blocks = df1.groupby('block_num').first().sort_values('top').index.tolist()
    for block in sorted_blocks:
        curr = df1[df1['block_num']==block]
        sel = curr[curr.text.str.len()>3]
        char_w = (sel.width/sel.text.str.len()).mean()
        prev_par, prev_line, prev_left = 0, 0, 0
        text = ''
        for ix, ln in curr.iterrows():
            # add new line when necessary
            if prev_par != ln['par_num']:
                text += '\n'
                prev_par = ln['par_num']
                prev_line = ln['line_num']
                prev_left = 0
            elif prev_line != ln['line_num']:
                text += '\n'
                prev_line = ln['line_num']
                prev_left = 0
    
            added = 0  # num of spaces that should be added
            if ln['left']/char_w > prev_left + 1:
                added = int((ln['left'])/char_w) - prev_left
                text += ' ' * added 
            text += ln['text'] + ' '
            prev_left += len(ln['text']) + added + 1
        text += '\n'
        print(text)
    

    This code will produce following output:

        ssseeess+ SERVIZIO SANITARIO REGIONALE                          Pagina 2 di3 
       seoeeeees EMILIA-RROMAGNA 
         ©2888   800 
         ©9868  6 006   :       pe   ‘  ‘        " 
         «ee @@e@ecee Azienda Unita Sanitaria Locale di Modena 
         Seat se  ces Amends Ospedaliero-Universitaria Policlinico di Modena 
             Dipartimento  interaziendale ad attivita integrata di Medicina di Laboratorio e Anatomia Patologica 
                                                      Direttore dr. T.Trenti 
                                               Ospedale Civile S.Agostino-Estense 
                                                 S.C. Medicina  di Laboratorio 
                                               S.S. Patologia  Clinica - Corelab 
                                Sistema di Gestione per la Qualita certificato UNI EN ISO 9001:2015 
                                                  Responsabile dr.ssa M.Varani 
            Richiesta (CDA):   49/073914                                    Data di accettazione: 18/12/2018 
                                                                            Data di check-in:    18/12/2018 10:27:06 
                                                                            Referto del          18/12/2018 16:39:53 
                                                                            Provenienza:         D4-cp sassuolo 
    
                                                               Sig. 
                                                               Data di Nascita: 
                                                               Domicilio: 
              ANALISI                                              RISULTATO  __UNITA'DI MISURA VALORI DI RIFERIMENTO 
           Glucosio                                                     95     mg/dl            (70  - 110 ) 
           Creatinina                                                 1.03     mg/dl            ( 0.50 - 1.40 ) 
           eGFR  Filtrato glomerulare stimato                         >60      ml/min           Cut-off per rischio di  I.R. 
                 7                                                                              <60. Il calcolo é€ riferito 
           Equazione  CKD-EPI                                                                   ad una superfice corporea 
                                                                                                Standard  (1,73 mq)x In Caso 
                                                                                                di etnia afroamericana 
                                                                                                moltiplicare per  il fattore 
                                                                                                1,159. 
           Colesterolo                                                212   *  mg/dl            < 200 v.desiderabile 
           Trigliceridi                                                106     mg/dl            < 180 v.desiderabile 
           Bilirubina totale                                          0.60     mg/dl            ( 0.16 - 1.10 ) 
           Bilirubina diretta                                         0.10     mg/dl            ( 0.01 - 0.3 ) 
           GOT  - AST                                                   17     U/L              (1-37) 
           GPT  - ALT                                                   ay     U/L              (1-   40 ) 
           Gamma-GT                                                     15     U/L              (1-55) 
           Sodio                                                       142     mEq/L            ( 136 - 146 ) 
           Potassio                                                    4.3     mEq/L            (3.5  - 5.3) 
           Vitamina B12                                               342      pg/ml            ( 200 - 960 ) 
           TSH                                                        5.47  *  ulU/ml           (0.35  - 4.94 ) 
           FT4                                                         9.7     pg/ml            (7  = 15) 
           Urine chimico fisico morfologico 
              u-Colore                                     giallo paglierino 
              u-Peso specifico                                       1.012                      ( 1.010 - 1.027  ) 
              u-pH                                                     5.5                      (5.5  - 6.5) 
              u-Glucosio                                           assente     mg/dl            assente 
              u-Proteine                                           assente     mg/dl            (0  -10 ) 
              u-Emoglobina                                         assente     mg/dl            assente 
              u-Corpi chetonici                                    assente     mg/dl            assente 
              u-Bilirubina                                         assente     mg/dl            assente 
              u-Urobilinogeno                                         0.20     mg/dl            (0-   1.0 ) 
              sedimento                                    non significativo 
                                                                                              Il Laureato: 
                                                                                                         Dott. CRISTINA ROTA 
           Per ogni informazione o chiarimento sugli aspetti medici, puo rivolgersi al suo medico curante 
           Referto firmato elettronicamente secondo le norme vigenti: Legge 15 marzo 1997, n. 59; D.P.R. 10 novembre 1997, n.513; 
           D.P.C.M. 8 febbraio 1999; D.P.R 28 dicembre 2000, n.445; D.L. 23 gennaio 2002, n.10. 
           Certificato rilasciato da: Infocamere S.C.p.A. (http://www.card.infocamere. it) 
           i! Laureato: Dr. CRISTINA ROTA 
           1! documento informatico originale 6 conservato presso Parer - Polo Archivistico della Regione Emilia-Romagna