Search code examples
ocrtesseract

Tesseract training for a new font


I'm still new to Tesseract OCR and after using it in my script noticed it had a relatively big error rate for the images I was trying to extract text from. I came across Tesseract training, which supposedly would be able to decrease error rate for a specific font you'd use. I came across a website (http://ocr7.com/) which is a tool powered by Anyline to do all the training for a font you specify. So I recieved a .traineddata file and I am not quite sure what to do with it. Could anybody explain what I have to do with this file for it to work? Or should I just learn how to do Tesseract training the manual way, which according to the Anyline website may take a day's work. Thanks in advance.


Solution

  • This might be a late responde, but for the question shows up on Google.

    Newer versions of Tesseract come shipped with a bunch of tools to make this really easy, without having to do manual work with a box editor.

    text2image lets you generate both the .tif file and its respective .box file for use with tesstrain.

    text2image \
        --font=Font Name \
        --fonts_dir=Optional Fonts Dir \
        --text=path/to/textfile
        --outputbase=path/to/output
        --max_pages=1 \
        --leading=32 \
        --xsize=3600 \
        --ysize=480 \
        --char_spacing=1.0 \
        --exposure=0 \
        --unicharset_file=path/to/unicharset
    

    I believe the --unicharset_file parameter may be optional.