Search code examples
pythonpython-tesseract

Detecting Bangla characters using pytesseract


I am trying to detect Bangla characters from images of Bangla number plates using Python, so I decided to use pytesseract. For this purpose I have used below code:

import pytesseract
from PIL import Image

pytesseract.pytesseract.tesseract_cmd = 'C:/Program Files (x86)/Tesseract-OCR/tesseract'
text = pytesseract.image_to_string(Image.open('input.png'),lang="ben")
print(text)

The problem is when I am printing, it is showing as empty output.

enter image description here

When I tried to freeze it in a text, it is showing like:

enter image description here

Example Picture: (Link)

enter image description here

Expected Output (should be something like or should be somewhat relatable like):

ঢাকা মেট্রো হ

৪৫ ২৩০৭

P.S: I have downloaded Bengali language data while installing Tesseract-OCR-64 and I am trying to run it in VS Code.

Can anyone help me to solve this problem or give me an idea of how to solve this problem?


Solution

  • The solution to this problem is:

    You need to segment all the characters (you can take any approach if you want, can be deep learning or image processing) and feed the PyTesseract only the character. (only clear photos)

    Reason: It can detect the Bangla language from pictures of clear and considerably acceptable resolution. It might have considerably fewer models trained for this language for pictures of small size. (which is quite understandable)

    Code:

    ### any deep learning approach or any image processing approach here
    
    # load the segmented character
    
    import pytesseract
    from PIL import Image
    
    pytesseract.pytesseract.tesseract_cmd = 'C:/Program Files (x86)/Tesseract-OCR/tesseract'
    character = pytesseract.image_to_string(Image.open('char.png'),lang="ben")
    print(character)