I am working on a project where I should apply and OCR on some documents.
The first step is to threshold the image and let only the writing (whiten the background).
Example of an input image: (For the GDPR and privacy reasons, this image is from the Internet)
import cv2
import numpy as np
image = cv2.imread('b.jpg')
image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
h = image.shape[0]
w = image.shape[1]
for y in range(0, h):
for x in range(0, w):
if image[y, x] >= 120:
image[y, x] = 255
else:
image[y, x] = 0
cv2.imwrite('output.jpg', image)
Here is the result that I got:
When I applied pytesseract to the output image, the results were not satisfying (I know that an OCR is not perfect). Although I tried to adjust the threshold value (in this code it is equal to 120), the result was not as clear as I wanted.
Is there a way to make a better threshold in order to only keep the writing in black and whiten the rest?
You can use adaptive thresholding. From documentation :
In this, the algorithm calculate the threshold for a small regions of the image. So we get different thresholds for different regions of the same image and it gives us better results for images with varying illumination.
import numpy as np
import cv2
image = cv2.imread('b.jpg')
image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
image = cv2.medianBlur(image ,5)
th1 = cv2.adaptiveThreshold(image,255,cv2.ADAPTIVE_THRESH_MEAN_C,\
cv2.THRESH_BINARY,11,2)
th2 = cv2.adaptiveThreshold(image,255,cv2.ADAPTIVE_THRESH_GAUSSIAN_C,\
cv2.THRESH_BINARY,11,2)
cv2.imwrite('output1.jpg', th1 )
cv2.imwrite('output2.jpg', th2 )