I have a datasets with a lot of gt.txt and tiff files about 1000 files, I tried to use the tesstrain project and run the follow command make training MODEL_NAME=cmc7 TESSDATA=path/to/tessdata_best This command run with success but when I try to use the traineddata It doesn't work as expected. My question is what is the right form to training my datasets for tesseract? Thank you.
To train my dataset with images I use 2 types of files in adition to the images:
I place all 3 files inside tesstrain/data/my-model-ground-truth and run the following command from the tesstrain folder:
make training MODEL_NAME=my-model START_MODEL=eng TESSDATA=../tessdata_best
That is supposing you want to train on top of the eng.traineddata from the tessdata_best repository: https://github.com/tesseract-ocr/tessdata_best
That generates my-model.traineddata inside the tesstrain/data folder