Search code examples
tesseracttraining-data

Data needed to train Tesseract OCR for custom Language


I am trying to build a CUSTOM language for detecting only following characters:

['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '<', '<<<', '/']

I have almost 50 images for which I have generated box files corrected the errors. My question is for training tesseract for the above customized characters is it needed to use images which were created by tesseract tool to be used also as an input while creating cust.traindata

I have made a code which from the above array takes 5 character and builds an image using tesseract tool and then later generates the .box file which is proper and doesn't need tunning for all possible configurations but since tesseract as created it does it need to be given for building the cust.traindata.

Thanks in advance.


Solution

  • We don't need to create a new language if we want tesseract to use default "eng" language to predict following letters ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '<', '<<<', '/']

    You just need to add following configuration to tesseract tessedit_char_whitelist="ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789<"

    eg.

    tesseract input_image output_text -l eng -c tessedit_char_whitelist="ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789<"