Search code examples
pythonocrtesseracteasyocr

Extract only characters from a image opencv or OCR


From a group of text like below enter image description here

I want to MAKE a BOUNDING BOX on INDIVIDUAL CHARACTER. However, I am unable to do so.

I've tried to use Easy OCR with following settings but it only recognizes individual words:

reader = eo.Reader(['en'],gpu=True)
result = reader.readtext(imgOriginal,y_ths=0.0000000001,x_ths=0.0000000001,paragraph=False)

I tried to set psm/oem in tesserocr/pytesserocr but still I wasn't able to get the individual character. Please Help.


Solution

  • Have a look at GetComponentImage example from tesserocr and adapt it:

    from PIL import Image, ImageOps
    from tesserocr import PyTessBaseAPI, RIL
    
    image = ImageOps.grayscale(Image.open('test.png'))).convert('L')
    with PyTessBaseAPI(path=tessdata_path, psm=tesserocr.PSM.SPARSE_TEXT) as api:
        api.SetImage(image)
        api.Recognize()
        boxes = api.GetComponentImages(RIL.SYMBOL, True)
        print('Found {} symbol image components.'.format(len(boxes)))
        for i, (im, box, _, _) in enumerate(boxes):
            print("Box[{0}]: x={x}, y={y}, w={w}, h={h}".format(i, **box))
            # display(im)
    

    If boxes are not accurate try to use oem=tesserocr.OEM.TESSERACT_ONLY with correct trainneddata.