Search code examples
imageopencvimage-processingocrnoise

Remove noise and staining in historical documents for OCR recognition


Hi I am trying to clean as much as possible noise from historical documents.

These documents have staining that are like small dots throughout the document and is effecting OCR and handwriting recognition. Apart from Image Denoising from OpenCV is there a more effective way to clean such images?

enter image description here


Solution

  • A potential approach is to adaptive threshold, perform some morphological operations, and remove noise using aspect ratio + contour area filtering. From here we can bitwise-and the resulting mask and the input image to get a cleaned image. Here's the result:

    enter image description here

    Since you didn't specify a language, I implemented it in Python

    import cv2
    import numpy as np
    
    # Load image, create blank mask, convert to grayscale, Gaussian blur
    # then adaptive threshold to obtain a binary image
    image = cv2.imread('1.jpg')
    mask = np.zeros(image.shape, dtype=np.uint8)
    gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
    blur = cv2.GaussianBlur(gray, (7,7), 0)
    thresh = cv2.adaptiveThreshold(blur,255,cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY_INV,51,9)
    
    # Create horizontal kernel then dilate to connect text contours
    kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (5,2))
    dilate = cv2.dilate(thresh, kernel, iterations=2)
    
    # Find contours and filter out noise using contour approximation and area filtering
    cnts = cv2.findContours(dilate, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
    cnts = cnts[0] if len(cnts) == 2 else cnts[1]
    for c in cnts:
        peri = cv2.arcLength(c, True)
        approx = cv2.approxPolyDP(c, 0.04 * peri, True)
        x,y,w,h = cv2.boundingRect(c)
        area = w * h
        ar = w / float(h)
        if area > 1200 and area < 50000 and ar < 6:
            cv2.drawContours(mask, [c], -1, (255,255,255), -1)
    
    # Bitwise-and input image and mask to get result
    mask = cv2.cvtColor(mask, cv2.COLOR_BGR2GRAY)
    result = cv2.bitwise_and(image, image, mask=mask)
    result[mask==0] = (255,255,255) # Color background white
    
    cv2.imshow('thresh', thresh)
    cv2.imshow('mask', mask)
    cv2.imshow('result', result)
    cv2.waitKey()