Search code examples
androidtesseract

Tesseract TessData fonts used for training


I am using tesseract for OCR in an Android app. I am focusing on the Chinese language but I only need to recognise a few keywords so I was thinking of creating my .traineddata files using jTessBoxEditor. I wanted to know what fonts does the Chinese Traditional TessData file use? https://github.com/tesseract-ocr/tessdata

Alternatively, is there a way that I can edit the chi_tra.traineddata file so it only recognises a few keywords? The main reason I am doing this is because the file size is 63.4 MB and tesseract takes around 2 to 3 minutes before finishing. The accuracy is great but is slow.


Solution

  • The font_properties file of all tesseract trained languages can be found in github. You may check the traditional chinese specific fonts supported from the list.

    From tesseract-ocr/langdata folder here in github, you can check the chi_tra.wordlist inside chi_tra folder to find the words used for training.