I'm trying to create training data for Tesseract 4.0 to identify icons (like, comment, share, save) in screenshots. This is a sample screenshot:
I would like to fine tune the Tesseract to achieve output as below:
Like 147
Comment 29
Saved 5
Actions
58
Actions
Profile Visits 24
Follows 2
I have followed step-by-step as stated in https://pretius.com/how-to-prepare-training-files-for-tesseract-ocr-and-improve-characters-recognition/
I modified the box file as below:
- Heart : Like
- Speech bubble: Comment
- Bookmark: Saved
- Arrow: Share
But, the final training data failed to read the icon as I wanted. Example of error I've got is 'Like is not in unicharset'. Do I have to do something different when creating the unicharset for icons?
I've figured it out. The box editor expects single letter/number instead of full words. I have used Unicode character to interpret my icons. The steps are as below: