Search code examples

Error on Tesseract lstmtraining --continue_from

My goal is - to add "Hand Writing" font, to Hebrew language.

  • I did succeed in creating files: .tif and .box, and then .tr.

  • But not with creating the trained-data. I'm getting an Error :

Loaded file output/.tr, unpacking...
Failed to read continue from: output/.tr


  • I'm using the "best" version, not the "fast". ("/tessdata" contains the "best" variant as "heb.traineddata")
  • I'm using langdata_lstm
  • For POC purposes, the max-pages is only 2
  • I'm using Windows 11, with tesseract v5.4.0

Help will be appreciated, please.

My Script :

--font="Handwriting Regular" 

tesseract "output/2.tif" "output/2" -l heb box.train.stderr 


Output :

Loaded file output/.tr, unpacking...
Failed to read continue from: output/.tr


  • Installation ::

    Directory structure ::

    /app               (tesseract-ocr-w64-setup-
        /gui           (
        /langdata_lstm (github)
        /tessdata      (exist)
        /tessdata_best (github)

    Lastly, on GUI, click on "Re-check requirements".

    Explanation ::

    • We create tif and gt.txt, from both the new font, and the original heb.
    • Than create a joined checkout from all of them.
    • Than generate a new traineddata.

    Steps ::

    1. create a per-line files : .tif, gt.txt, .box.

    note: it uses 'app/langdata_lstm'.
    note: I had to install tff (font type) in my Windows (The app can't just read them from a library)

    python heb_hw/ 

    1. use gui to Start Training

    set 'tessData folder' to 'app\tessdata_best' note: the installed variant doesn't allow appending ('best') set 'Input ground truth dir' to 'heb_hw\gt' set 'Output dir' to 'heb_hw/data' set 'New language model name' to 'heb_hw' set 'Language type' to 'RTL' note: in this step, it creates per-line files, from og heb. with them, and the files from step 1, it creates a checkout file.

    1. use gui to Generate Best + Fast trained-data

    note: allow, at the end, copying Fast to app/tessdata (for testing)

    1. test

    first, copy the traineddata from "heb_hw\data\heb_hw\traineddata_fast" to "app/tessdata"

    tesseract -l heb_hw_fast test/test.jpg "ocr (heb_hw)"

    DONE ! ::

    import os
    import random
    import pathlib
    import subprocess
    langdata = 'app/langdata_lstm'
    training_text_file = f'{langdata}/heb/heb.training_text'
    unicharset = f'{langdata}/heb.unicharset'
    output_directory = 'heb_hw/gt'
    count = 10000
    lines = []
    fonts = ['Gveret Levin AlefAlefAlef Regular','Anka CLM Bold Expanded','Dana Yad AlefAlefAlef Condensed','Gadi Almog AlefAlefAlef Regular','Ktav Yad CLM Medium Italic']
    # Open the training text file with UTF-8 encoding
    with open(training_text_file, 'r', encoding='utf-8') as input_file:
        for line in input_file.readlines():
    if not os.path.exists(output_directory):
    lines = lines[:count]
    line_count = 0
    for line in lines:
        file_name_stem = pathlib.Path(training_text_file).stem
        for font in range(len(fonts)):
            file_name = f'{file_name_stem}_f{str(font)}_{line_count}'
            line_training_text = os.path.join(output_directory, f'{file_name}.gt.txt')
            with open(line_training_text, 'w', encoding='utf-8') as output_file:
            outputbase = f'{output_directory}/{file_name}'
        line_count += 1
        print (line_count, ' / ', count)