Search code examples

Train Tesseract to label icons

I'm trying to create training data for Tesseract 4.0 to identify icons (like, comment, share, save) in screenshots. This is a sample screenshot:
sample screenshot

I would like to fine tune the Tesseract to achieve output as below:
Like 147
Comment 29
Saved 5
Profile Visits 24
Follows 2

I have followed step-by-step as stated in

I modified the box file as below:
- Heart : Like
- Speech bubble: Comment
- Bookmark: Saved
- Arrow: Share

But, the final training data failed to read the icon as I wanted. Example of error I've got is 'Like is not in unicharset'. Do I have to do something different when creating the unicharset for icons?


  • I've figured it out. The box editor expects single letter/number instead of full words. I have used Unicode character to interpret my icons. The steps are as below:

    1. Crop all target icons that you wish for Tesseract to detect and save it in one file named as (in my case) own.std.exp0.png
    2. Create box file using the command 'tesseract own.std.exp0.png own.std.exp0 makebox'
    3. Open jTessBoxEditor and input unicode at the char column. The list of supported unicode can be found under program Character Map ( Example: For heart symbol I used U+2665. Note that some unicode are not supported. It shows as blank square. So, keep trying till you find one that works. My final edited box file looks like this.
      edited box file
    4. Create the final training file which will be own.trainneddata (can be done as shown here or train using jTessBoxEditor).
    5. Copy the own.traineddata to the directory Tesseract/tessdata and run Tesseract using lang='own+eng'. I used pytesseract and the output is as below:
      tesseract output