Search code examples
pythonopencvocrtesseract

Clean text images with OpenCV for OCR reading


I received some images that need to be treated in order to OCR some information out of them. Here are the originals:

original 1

original 1

original 2

original 2

original 3

original 3

original 4

original 4

After processing them with this code:

img = cv2.imread('original_1.jpg', 0) 
ret,thresh = cv2.threshold(img,55,255,cv2.THRESH_BINARY)
opening = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, cv2.getStructuringElement(cv2.MORPH_RECT,(2,2)))
cv2.imwrite('result_1.jpg', opening)

I get these results:

result 1

result 1

result 2

result 2

result 3

result 3

result 4

result 4

As you can see, some images get nice results for OCR reading, other still maintain some noise in the background.

Any suggestions as how to clean up the background?


Solution

  • MH304's answer is very nice and straightforward. In the case you can't use morphology or blurring to get a cleaner image, consider using an "Area Filter". That is, filter every blob that does not exhibit a minimum area.

    Use opencv's connectedComponentsWithStats, here's a C++ implementation of a very basic area filter:

    cv::Mat outputLabels, stats, img_color, centroids;
    
    int numberofComponents = cv::connectedComponentsWithStats(bwImage, outputLabels, 
    stats, centroids, connectivity);
    
    std::vector<cv::Vec3b> colors(numberofComponents+1);
    colors[i] = cv::Vec3b(rand()%256, rand()%256, rand()%256);
    
    //do not count the original background-> label = 0:
    colors[0] = cv::Vec3b(0,0,0);
    
    //Area threshold:
    int minArea = 10; //10 px
    
    for( int i = 1; i <= numberofComponents; i++ ) {
    
        //get the area of the current blob:
        auto blobArea = stats.at<int>(i-1, cv::CC_STAT_AREA);
    
        //apply the area filter:
        if ( blobArea < minArea )
        {
            //filter blob below minimum area:
            //small regions are painted with (ridiculous) pink color
            colors[i-1] = cv::Vec3b(248,48,213);
    
        }
    
    }
    

    Using the area filter I get this result on your noisiest image:

    enter image description here

    **Additional info:

    Basically, the algorithm goes like this:

    • Pass a binary image to connectedComponentsWithStats. The function will compute the number of connected components, matrix of labels and an additional matrix with statistics – including blob area.

    • Prepare a color vector of size “numberOfcomponents”, this will help visualize the blobs that we are actually filtering. The colors are generated randomly by the rand function. From a range 0 – 255, 3 values for each pixel: BGR.

    • Consider that the background is colored in black, so ignore this “connected component” and its color (black).

    • Set an area threshold. All blobs or pixels below this area will be colored with a (ridiculous) pink.

    • Loop thru all the found connected components (blobs), retrive the area for the current blob via the stats matrix and compare it to the area threshold.

    • If the area is below the threshold, color the blob pink (in this case, but usually you want black).