Error on Tesseract lstmtraining --continue_from

My goal is - to add "Hand Writing" font, to Hebrew language.

I did succeed in creating files: .tif and .box, and then .tr.
But not with creating the trained-data. I'm getting an Error :

Loaded file output/.tr, unpacking...
Failed to read continue from: output/.tr

Notes:

I'm using the "best" version, not the "fast". ("/tessdata" contains the "best" variant as "heb.traineddata")
I'm using langdata_lstm
For POC purposes, the max-pages is only 2
I'm using Windows 11, with tesseract v5.4.0

Help will be appreciated, please.

My Script :

#
text2image 
--text="langdata_lstm/heb/heb.training_text" 
--outputbase="output/2" 
--font="Handwriting Regular" 
--D="output" 
--fonts_dir="fonts" 
--max_pages="2" 

#
tesseract "output/2.tif" "output/2" -l heb box.train.stderr 

#
lstmtraining 
--stop_training 
--continue_from="output/2.tr" 
--traineddata="tessdata/heb.traineddata" 
--model_output="output/2.traineddata"

Output :

Loaded file output/.tr, unpacking...
Failed to read continue from: output/.tr

Solution

Installation ::

python-3.12.4-amd64.exe
tesseract-ocr-w64-setup-5.4.0.20240606.exe
tesstrain-windows-GUI-main.zip (https://codeload.github.com/buliasz/tesstrain-windows-gui/zip/refs/heads/main)
AutoHotkey_2.0.18_setup.exe (GUI's dependency) (https://www.autohotkey.com/download/ahk-v2.exe)

Directory structure ::

/app               (tesseract-ocr-w64-setup-5.4.0.20240606.exe)
    /gui           (tesstrain-windows-gui-main.zip)
    /langdata_lstm (github)
    /tessdata      (exist)
    /tessdata_best (github)
/heb_hw
    /data
    /gt

Lastly, on GUI, click on "Re-check requirements".

Explanation ::

We create tif and gt.txt, from both the new font, and the original heb.
Than create a joined checkout from all of them.
Than generate a new traineddata.

Steps ::

create a per-line files : .tif, gt.txt, .box.

note: it uses 'app/langdata_lstm'.
note: I had to install tff (font type) in my Windows (The app can't just read them from a library)

python heb_hw/gt.py

use gui to Start Training

set 'tessData folder' to 'app\tessdata_best' note: the installed variant doesn't allow appending ('best') set 'Input ground truth dir' to 'heb_hw\gt' set 'Output dir' to 'heb_hw/data' set 'New language model name' to 'heb_hw' set 'Language type' to 'RTL' note: in this step, it creates per-line files, from og heb. with them, and the files from step 1, it creates a checkout file.

use gui to Generate Best + Fast trained-data

note: allow, at the end, copying Fast to app/tessdata (for testing)

test

first, copy the traineddata from "heb_hw\data\heb_hw\traineddata_fast" to "app/tessdata"

tesseract -l heb_hw_fast test/test.jpg "ocr (heb_hw)"

DONE !

gt.py ::

import os
import random
import pathlib
import subprocess

langdata = 'app/langdata_lstm'
training_text_file = f'{langdata}/heb/heb.training_text'
unicharset = f'{langdata}/heb.unicharset'
output_directory = 'heb_hw/gt'
count = 10000
lines = []

fonts = ['Gveret Levin AlefAlefAlef Regular','Anka CLM Bold Expanded','Dana Yad AlefAlefAlef Condensed','Gadi Almog AlefAlefAlef Regular','Ktav Yad CLM Medium Italic']


# Open the training text file with UTF-8 encoding
with open(training_text_file, 'r', encoding='utf-8') as input_file:
    for line in input_file.readlines():
        lines.append(line.strip())

if not os.path.exists(output_directory):
    os.mkdir(output_directory)

random.shuffle(lines)

lines = lines[:count]

line_count = 0
for line in lines:
    file_name_stem = pathlib.Path(training_text_file).stem

    for font in range(len(fonts)):
        file_name = f'{file_name_stem}_f{str(font)}_{line_count}'
        
        line_training_text = os.path.join(output_directory, f'{file_name}.gt.txt')
        with open(line_training_text, 'w', encoding='utf-8') as output_file:
            output_file.writelines([line])

        outputbase = f'{output_directory}/{file_name}'

        subprocess.run([
            'text2image',
            f'--font={fonts[font]}', 
            f'--text={line_training_text}',
            f'--outputbase={outputbase}',
            '--max_pages=1',
            '--strip_unrenderable_words',
            f'--unicharset_file={unicharset}'
        ])

    line_count += 1

    print (line_count, ' / ', count)