Search code examples
ocrtesseract

Error on Tesseract lstmtraining --continue_from


My goal is - to add "Hand Writing" font, to Hebrew language.

  • I did succeed in creating files: .tif and .box, and then .tr.

  • But not with creating the trained-data. I'm getting an Error :

Loaded file output/.tr, unpacking...
Failed to read continue from: output/.tr

Notes:

  • I'm using the "best" version, not the "fast". ("/tessdata" contains the "best" variant as "heb.traineddata")
  • I'm using langdata_lstm
  • For POC purposes, the max-pages is only 2
  • I'm using Windows 11, with tesseract v5.4.0

Help will be appreciated, please.

My Script :

#
text2image 
--text="langdata_lstm/heb/heb.training_text" 
--outputbase="output/2" 
--font="Handwriting Regular" 
--D="output" 
--fonts_dir="fonts" 
--max_pages="2" 

#
tesseract "output/2.tif" "output/2" -l heb box.train.stderr 

#
lstmtraining 
--stop_training 
--continue_from="output/2.tr" 
--traineddata="tessdata/heb.traineddata" 
--model_output="output/2.traineddata" 

Output :

Loaded file output/.tr, unpacking...
Failed to read continue from: output/.tr

Solution

  • Installation ::


    Directory structure ::

    /app               (tesseract-ocr-w64-setup-5.4.0.20240606.exe)
        /gui           (tesstrain-windows-gui-main.zip)
        /langdata_lstm (github)
        /tessdata      (exist)
        /tessdata_best (github)
    /heb_hw
        /data
        /gt
    

    Lastly, on GUI, click on "Re-check requirements".


    Explanation ::

    • We create tif and gt.txt, from both the new font, and the original heb.
    • Than create a joined checkout from all of them.
    • Than generate a new traineddata.

    Steps ::

    1. create a per-line files : .tif, gt.txt, .box.

    note: it uses 'app/langdata_lstm'.
    note: I had to install tff (font type) in my Windows (The app can't just read them from a library)

    python heb_hw/gt.py 
    

    1. use gui to Start Training

    set 'tessData folder' to 'app\tessdata_best' note: the installed variant doesn't allow appending ('best') set 'Input ground truth dir' to 'heb_hw\gt' set 'Output dir' to 'heb_hw/data' set 'New language model name' to 'heb_hw' set 'Language type' to 'RTL' note: in this step, it creates per-line files, from og heb. with them, and the files from step 1, it creates a checkout file.


    1. use gui to Generate Best + Fast trained-data

    note: allow, at the end, copying Fast to app/tessdata (for testing)


    1. test

    first, copy the traineddata from "heb_hw\data\heb_hw\traineddata_fast" to "app/tessdata"

    tesseract -l heb_hw_fast test/test.jpg "ocr (heb_hw)"
    

    DONE !


    gt.py ::

    import os
    import random
    import pathlib
    import subprocess
    
    langdata = 'app/langdata_lstm'
    training_text_file = f'{langdata}/heb/heb.training_text'
    unicharset = f'{langdata}/heb.unicharset'
    output_directory = 'heb_hw/gt'
    count = 10000
    lines = []
    
    fonts = ['Gveret Levin AlefAlefAlef Regular','Anka CLM Bold Expanded','Dana Yad AlefAlefAlef Condensed','Gadi Almog AlefAlefAlef Regular','Ktav Yad CLM Medium Italic']
    
    
    # Open the training text file with UTF-8 encoding
    with open(training_text_file, 'r', encoding='utf-8') as input_file:
        for line in input_file.readlines():
            lines.append(line.strip())
    
    if not os.path.exists(output_directory):
        os.mkdir(output_directory)
    
    random.shuffle(lines)
    
    lines = lines[:count]
    
    line_count = 0
    for line in lines:
        file_name_stem = pathlib.Path(training_text_file).stem
    
        for font in range(len(fonts)):
            file_name = f'{file_name_stem}_f{str(font)}_{line_count}'
            
            line_training_text = os.path.join(output_directory, f'{file_name}.gt.txt')
            with open(line_training_text, 'w', encoding='utf-8') as output_file:
                output_file.writelines([line])
    
            outputbase = f'{output_directory}/{file_name}'
    
            subprocess.run([
                'text2image',
                f'--font={fonts[font]}', 
                f'--text={line_training_text}',
                f'--outputbase={outputbase}',
                '--max_pages=1',
                '--strip_unrenderable_words',
                f'--unicharset_file={unicharset}'
            ])
    
        line_count += 1
    
        print (line_count, ' / ', count)