Search code examples
pythonimage-processingtesseractpython-tesseract

Turning off English dictionary word for pytessaract (for an alpr system)


I am using pytessaract to do an image to text conversion of a numberplate for something like this

number plate

tools = pyocr.get_available_tools()
if len(tools) == 0:
    print("No OCR tool found")
    exit(1)

# The tools are returned in the recommended order of usage
tool = tools[0]
print("Will use tool '%s'" % (tool.get_name()))
# Ex: Will use tool 'libtesseract'

langs = tool.get_available_languages()
print("Available languages: %s" % ", ".join(langs))
lang = langs[0]
print("Will use lang '%s'" % (lang))

This is how i read it I whitelist all the characters that it could be

text = pytesseract.image_to_string(Image.open('images/text.jpg'), config= "-c tessedit_char_whitelist=0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ")

Right now pytessaract is reading this as if it was looking for a dictionary word and this is giving less than desirable result There is a way to turn of dictionary words but i cannot figure out how to do it in python That is my question Thanks


Solution

  • Add config file with disabled system and frequent DAWG

    load_system_dawg     F
    load_freq_dawg       F
    

    Config files should be placed in tessdata/configs directory (ex: tessdata/configs/config) and passed to tesseract during Init procedure.
    I am not 100% confident how it is done with pytesseract but I believe you can elaborate here.

    init() function signature is something like that:

    const char *    datapath,
    const char *    language,
    OcrEngineMode   oem,
    char **     configs,
    int     configs_size,
    const GenericVector< STRING > *     vars_vec,
    const GenericVector< STRING > *     vars_values,
    bool    set_only_non_debug_params
    

    So you need to set configs to pointer to pointer to "config" and configs_size to 1

    So probably something like that, you can elaborate to make this working:

    api = tesseract.TessBaseAPI()
    api.Init(".","eng",tesseract.OEM_TESSERACT_ONLY, POINTER(ctypes.c_char_p("config")), 1, None, None, False)
    

    EDIT:
    Also note that disabling DAWG might not solve your issue. If I were you - I would simply iterate over results' alternatives and take the letter with highest confidence (if DAWG search is on - default letters would not always be the ones with highest confidence) & work more on improving input image quality as described here.