Clean text images with OpenCV for OCR reading

I received some images that need to be treated in order to OCR some information out of them. Here are the originals:

original 1

original 1

original 2

original 2

original 3

original 3

original 4

original 4

After processing them with this code:

img = cv2.imread('original_1.jpg', 0) 
ret,thresh = cv2.threshold(img,55,255,cv2.THRESH_BINARY)
opening = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, cv2.getStructuringElement(cv2.MORPH_RECT,(2,2)))
cv2.imwrite('result_1.jpg', opening)

I get these results:

result 1

result 1

result 2

result 2

result 3

result 3

result 4

result 4

As you can see, some images get nice results for OCR reading, other still maintain some noise in the background.

Any suggestions as how to clean up the background?

Solution

MH304's answer is very nice and straightforward. In the case you can't use morphology or blurring to get a cleaner image, consider using an "Area Filter". That is, filter every blob that does not exhibit a minimum area.

Use opencv's connectedComponentsWithStats, here's a C++ implementation of a very basic area filter:

cv::Mat outputLabels, stats, img_color, centroids;

int numberofComponents = cv::connectedComponentsWithStats(bwImage, outputLabels, 
stats, centroids, connectivity);

std::vector<cv::Vec3b> colors(numberofComponents+1);
colors[i] = cv::Vec3b(rand()%256, rand()%256, rand()%256);

//do not count the original background-> label = 0:
colors[0] = cv::Vec3b(0,0,0);

//Area threshold:
int minArea = 10; //10 px

for( int i = 1; i <= numberofComponents; i++ ) {

    //get the area of the current blob:
    auto blobArea = stats.at<int>(i-1, cv::CC_STAT_AREA);

    //apply the area filter:
    if ( blobArea < minArea )
    {
        //filter blob below minimum area:
        //small regions are painted with (ridiculous) pink color
        colors[i-1] = cv::Vec3b(248,48,213);

    }

}

Using the area filter I get this result on your noisiest image:

**Additional info:

Basically, the algorithm goes like this:

Pass a binary image to connectedComponentsWithStats. The function will compute the number of connected components, matrix of labels and an additional matrix with statistics – including blob area.
Prepare a color vector of size “numberOfcomponents”, this will help visualize the blobs that we are actually filtering. The colors are generated randomly by the rand function. From a range 0 – 255, 3 values for each pixel: BGR.
Consider that the background is colored in black, so ignore this “connected component” and its color (black).
Set an area threshold. All blobs or pixels below this area will be colored with a (ridiculous) pink.
Loop thru all the found connected components (blobs), retrive the area for the current blob via the stats matrix and compare it to the area threshold.
If the area is below the threshold, color the blob pink (in this case, but usually you want black).