Search code examples
pythoncomputer-visionocrhebrewpython-tesseract

Extracting Hebrew text from image in python


I want to extract Hebrew text from an image.

I've tried using pytesseract, but it gets some letters confused (for example ' instead of י or נ instead of כ)

I tried doing some manipulations on the image (such as resizing, removing noise and binarization) which helped a little but still got many mistakes.

I've spent hours searching for better text extraction tools but couldn't find.

So here's my question:

A) Is there a tool I can use that I might have missed?

B) If not, what are the steps to creating my own?

Thanks in advance, Amichai


Solution

  • Choosing the right OCR can be a hard thing, but you seem to be on the right track already (as seen in this Stackoverflow post).

    Generally, if you are not satisfied with the quality of Tesseract, you seem to be (mostly) out of luck; from what I read, it seems that there might be an alternative in OCROpus, although that seems less straightforward than the PyTesseract approach.
    Also, diving a little deeper into the GitHub repository of Tesseract revealed that there is a LSTM-based version 4.0 under active development, which might bring you better results. I am not fully aware of what Tesseract version PyTesseract is calling, but it might be worth investigating, since it could be easier to replace Tesseract than think yourself into a fully new environment.

    PS: As for the question "how to build my own OCR", I would advise heavily against it. Just collecting all the data and getting the basics right will cost you a lot of effort, and is generally not worth your time; if you get something useful at all, it will likely still be worse than any of the provided libraries.