Search code examples
python-imaging-libraryocrtesseractoutline

How to fill the outline text by PIL to using tesseract?


ocr image

Hi, I want to ocr this image using PIL and tesseract, generally it works fine, despite the outline number like 1148 in this image, tesseract could not recognize it. So I want to use PIL to fill the outline text 1148 to a solid text, but I do not know how to do it. Any help would be appreciated. Please.

And this is my code:

api = tesseract.TessBaseAPI()
api.Init(".","eng",tesseract.OEM_DEFAULT)
api.SetVariable("tessedit_char_whitelist", "0123456789.")
api.SetPageSegMode(tesseract.PSM_AUTO
pic = ImageGrab.grab((120,90,180,650)) 
pic = pic.filter(ImageFilter.CONTOUR)
pic.save("321.png")
mImgFile = "321.png"
mBuffer=open(mImgFile,"rb").read()
result = tesseract.ProcessPagesBuffer(mBuffer,len(mBuffer),api)
print result

Solution

  • You can try the experimental floodfill() function in ImageDraw.

    If you can figure out some points inside the digits, use it something like this:

    from PIL import ImageColor, ImageDraw
    draw = ImageDraw.Draw(pic)
    
    point_inside_digit = (some_x, some_y)
    
    ImageDraw.floodfill(im, point_inside_digit, ImageColor.getrgb("black"))
    
    del draw
    

    In addition to the white there's some blue and yellow in the digits, so it may be better to fill to black border:

    ImageDraw.floodfill(
        im, point_inside_digit, ImageColor.getrgb("black"),
        border=ImageColor.getrgb("black"))