Search code examples
pythonimageimage-processingtesseractpython-tesseract

How to get information from an image of a document, like name, CPF, RG, on python?


I'm sorry for the title of my question if it doesn't let clear my problem.

I'm trying to get information from an image of a document using tesseract, but it doesn't work well on pictures (on print screens of text it works very well). I want to ask if somebody know a technique that can help me. I think that letting the image black and white, where the information I want is in black would help a lot, but I don't know how to do that.

I will be glad if somebody knows how to help me. (:


Solution

  • Using opencv might help to preprocess the image before passing it to tesseract.

    I usually follow these steps

    1. Convert the image to grayscale
    2. If the texts in the image are small, resize the image using cv2.resize()
    3. Blur the image (GaussianBlur or MedianBlur)
    4. Apply threshhold to make the text prominent (cv2.threshold)
    5. Use tesseract config to instruct tesseract to look for specific characters. For example If the image contains only alphanumeric upper case english text then passing config='-c tessedit_char_whitelist=0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ" would help.