Search code examples
tesseracthebrewtesstrain

Tesseract training new font for Hebrew


I found this tutorial https://www.youtube.com/watch?v=KE4xEzFGSU8 here and tried to follow the instructions I git cloned both tesseract and tesstrain enter image description here

I added the heb.training_text from here https://github.com/HayekZH/LangData_Tesseract/tree/master/heb I made the folders and ran the python script that worked but the Training command: TESSDATA_PREFIX=../tesseract/tessdata make training MODEL_NAME=Apex START_MODEL= eng TESSDATA=../tesseract/tessdata MAX_ITERATIONS=10000 doesn't even seem to be supported. I need this font for rashi trained https://github.com/googlefonts/mekorot

Edit:

this was the script in the video TESSDATA_PREFIX=../tesseract/tessdata make training MODEL_NAME=Apex START_MODEL= eng TESSDATA=../tesseract/tessdata MAX_ITERATIONS=10000

import os
import random
import pathlib
import subprocess

training_text_file = 'langdata/heb.training_text'

lines = []

# Open the training text file with UTF-8 encoding
with open(training_text_file, 'r', encoding='utf-8') as input_file:
    for line in input_file.readlines():
        lines.append(line.strip())

output_directory = 'tesstrain/data/Rashi-ground-truth'

if not os.path.exists(output_directory):
    os.mkdir(output_directory)

random.shuffle(lines)

count = 81

lines = lines[:count]

line_count = 0
for line in lines:
    training_text_file_name = pathlib.Path(training_text_file).stem
    line_training_text = os.path.join(output_directory, f'{training_text_file_name}_{line_count}.gt.txt')
    with open(line_training_text, 'w', encoding='utf-8') as output_file:
        output_file.writelines([line])

    file_base_name = f'heb_{line_count}'

    subprocess.run([
        'text2image',
        '--font=Mekorot-Rashi Medium',  # Replace 'mer' with 'Mekorot-Rashi'
        f'--text={line_training_text}',
        f'--outputbase={output_directory}/{file_base_name}',
        '--max_pages=1',
        '--strip_unrenderable_words',
        '--leading=32',
        '--xsize=3600',
        '--ysize=480',
        '--char_spacing=1.0',
        '--exposure=0',
        '--unicharset_file=langdata/heb.unicharset'
    ])

    line_count += 1

Solution

  • https://github.com/buliasz/tesstrain-windows-gui Use this thing after making the .tifs and stuff with the Python script from this video https://www.youtube.com/watch?v=KE4xEzFGSU8. That GUI is so nice but you will need to also install the AutoHotkey stuff to get the GUI to run, that man deserves a coffee.