Search code examples
tesseractpython-tesseract

Using .traineddata with passportEye Python for MRZ


I am trying to improve accuracy of passport MRZ reading with tesseract ocr and passportEye I have found few github repositories containing "*.traineddata", it says to move it into tesseract ocr tessdata folder, I did that. No where in readme of these repos says how to use it, I believe it is something trivial, but I am very new to this tesseract thing.

How do I use it with passportEye in python, I am completely lost here. searched a lot. Here is the current code.

import os
from passporteye import read_mrz

pr_path = os.getcwd()
file_path = os.path.join(pr_path,'my_app', 'data')
mrz = read_mrz(file_path + '/test1.jpg') 

print(mrz)

This is the .traineddata file I want to test for more accuracy : https://github.com/DoubangoTelecom/tesseractMRZ/blob/master/tessdata_best/mrz.traineddata

I do not want to use bulky openCV. Please help


Solution

  • From looking into the source code I would say you can`t, without changing the codebase of PassportEye:

    Normally you would pass the language you are using via: -l paramerter to tesseract - in your case:

    -l mrz

    But the PassportEye implementation does not give you that option:

    https://github.com/konstantint/PassportEye/blob/929c186c4dfa80a1ac975b5f2b95002ca12889d0/passporteye/util/ocr.py#L48

    they pass lang=None, you would need to change that part to lang=mrz

    pytesseract.run_tesseract(input_file_name,
                              output_file_name_base,
                              'txt',
                              lang='mrz',
                              config=config)