Search code examples
tesseracttess4j

How do I train my dataset on Tesseract library?


I have a datasets with a lot of gt.txt and tiff files about 1000 files, I tried to use the tesstrain project and run the follow command make training MODEL_NAME=cmc7 TESSDATA=path/to/tessdata_best This command run with success but when I try to use the traineddata It doesn't work as expected. My question is what is the right form to training my datasets for tesseract? Thank you.


Solution

  • To train my dataset with images I use 2 types of files in adition to the images:

    • the gt.txt files with the expected output
    • box files generated out of the images with the changes I want to train the model with

    I place all 3 files inside tesstrain/data/my-model-ground-truth and run the following command from the tesstrain folder:

    make training MODEL_NAME=my-model START_MODEL=eng TESSDATA=../tessdata_best
    

    That is supposing you want to train on top of the eng.traineddata from the tessdata_best repository: https://github.com/tesseract-ocr/tessdata_best

    That generates my-model.traineddata inside the tesstrain/data folder