I have been working on a project to identify chemical structures based on diagrams. I plan to use Tesseract or some other OCR library for Python for this. The problem is, Tesseract is barely able to process the diagrams and provide accurate results.
I've tried various forms of preprocessing via OpenCV, including threshes, blurs, kernels etc. However, there is no improvement, with Tesseract picking up 1-2 characters and often returning nothing at all.
Does anyone have suggestions for what preprocessing to perform for images that will be roughly 600x800, have mostly broken spots of one or two characters and look like the attatchment provided?
OCR readers on websites like Brandfolder (https://brandfolder.com/workbench/extract-text-from-image?hl=en_GB) appear to provide satisfactory results, so I think this can be done.
My code for reference:
import cv2
import matplotlib.pyplot as plt
import pytesseract
from PIL import Image
import numpy as np
img_path = "D:\\User\\Intel_AI2\\Project\\Data\\Test\\nahco3.png"
img_cv = cv2.imread(img_path)
gray_img = cv2.cvtColor(img_cv, cv2.COLOR_BGR2GRAY)
inverted = cv2.bitwise_not(gray_img)
thresh, img_threshed = cv2.threshold(inverted, 50, 255, cv2.THRESH_BINARY)
kernel1 = np.ones((2, 2), np.uint8)
kernelised1 = cv2.dilate(img_threshed, kernel1, iterations=1)
kernel2 = np.ones((1, 1), np.uint8)
kernelised2 = cv2.erode(kernelised1, kernel2, iterations=1)
morphed = cv2.morphologyEx(kernelised2, cv2.MORPH_CLOSE, kernel2)
blurred = cv2.medianBlur(morphed, 3)
print(pytesseract.image_to_string(blurred, lang='eng', config='--psm 10 --oem 3'))
cv2.imshow("Normal", blurred)
cv2.waitKey(0)
cv2.destroyAllWindows()
The problem is, your png have no background. If you can use ImageMagick, you can do following:
import subprocess
import pytesseract
# Image manipulation
mag_img = r'D:\Programme\ImageMagic\magick.exe'
con_bw = r"D:\Programme\ImageMagic\convert.exe"
in_file = r'C2H4.png'
out_file = r'C2H4_bw.png'
# Play with black and white and size for better results
process = subprocess.run([con_bw, in_file, "-resize", "18%","-background", "white", "-alpha", "remove", "-alpha", "off", "-threshold","1%", out_file])
# Text ptocessing
pytesseract.pytesseract.tesseract_cmd=r'C:\Program Files\Tesseract-OCR\tesseract.exe'
# Configuration, use whitelist
options = r'--psm 6 --oem 3 tessedit_char_whitelist=HCIhci='
# OCR the input image using Tesseract
text_bw = pytesseract.image_to_string(out_file, config=options)
print(text_bw)
Output:
H H
\ /
C=C
/ \
H H