Search code examples
pythonopencvtesseractpython-tesseract

Tesseract OCR not recognizing any character


I am working on a project that requires character recognition as a part of it. I am using a handwriting dataset by IAM, so all the images are more or less taken in the same conditions. I am using pictures of words that have been provided by the dataset and following these steps

  • Binarizing and thresholding
  • Dividing the word into the characters constituting it
  • Resizing the extracted character
  • Letting tesseract figure out what the English alphabet is

What I'm trying to achieve is to store characters of a person's document in folders categorized by the alphabet and maybe form a template from them later on. For this I need to know which character it is.
Here's what I get as a result -
enter image description here

All the characters are properly segmented (for most cases). This is more of a tesseract question than it is a python question, but I'm using python to write the script and calling tesseract through the pytesseract wrapper.
I'm using OpenCV to manipulate the image. Images of these letter matrices are sent as input to tesseract (handled by pytesseract). The input is not an issue, I assure you. Is there anything else I need to do for tesseract to work?

None of these characters are recognized.


Solution

  • Tesseract doesn't support handwritten text well. You should try either ABBYY OCR for that or alternative free libraries like Lipi Toolkit.