Search code examples
pythonopencvtesseract

Image preprocessing and OCR for images of chemical structural formula


I have been working on a project to identify chemical structures based on diagrams. I plan to use Tesseract or some other OCR library for Python for this. The problem is, Tesseract is barely able to process the diagrams and provide accurate results.

I've tried various forms of preprocessing via OpenCV, including threshes, blurs, kernels etc. However, there is no improvement, with Tesseract picking up 1-2 characters and often returning nothing at all.

Does anyone have suggestions for what preprocessing to perform for images that will be roughly 600x800, have mostly broken spots of one or two characters and look like the attatchment provided?

OCR readers on websites like Brandfolder (https://brandfolder.com/workbench/extract-text-from-image?hl=en_GB) appear to provide satisfactory results, so I think this can be done.

My code for reference:

import cv2
import matplotlib.pyplot as plt
import pytesseract
from PIL import Image
import numpy as np

img_path = "D:\\User\\Intel_AI2\\Project\\Data\\Test\\nahco3.png"
img_cv = cv2.imread(img_path)
gray_img = cv2.cvtColor(img_cv, cv2.COLOR_BGR2GRAY)
inverted = cv2.bitwise_not(gray_img)
thresh, img_threshed = cv2.threshold(inverted, 50, 255, cv2.THRESH_BINARY)
kernel1 = np.ones((2, 2), np.uint8)
kernelised1 = cv2.dilate(img_threshed, kernel1, iterations=1)
kernel2 = np.ones((1, 1), np.uint8)
kernelised2 = cv2.erode(kernelised1, kernel2, iterations=1)
morphed = cv2.morphologyEx(kernelised2, cv2.MORPH_CLOSE, kernel2)
blurred = cv2.medianBlur(morphed, 3)

print(pytesseract.image_to_string(blurred, lang='eng', config='--psm 10 --oem 3'))

cv2.imshow("Normal", blurred)
cv2.waitKey(0)
cv2.destroyAllWindows()

Solution

  • The problem is, your png have no background. If you can use ImageMagick, you can do following:

    import subprocess
    import pytesseract
    
    # Image manipulation
    mag_img = r'D:\Programme\ImageMagic\magick.exe'
    con_bw = r"D:\Programme\ImageMagic\convert.exe" 
    
    in_file = r'C2H4.png'
    out_file = r'C2H4_bw.png'
    
    # Play with black and white and size for better results
    process = subprocess.run([con_bw, in_file, "-resize", "18%","-background", "white", "-alpha", "remove", "-alpha", "off", "-threshold","1%", out_file])
    
    # Text ptocessing
    pytesseract.pytesseract.tesseract_cmd=r'C:\Program Files\Tesseract-OCR\tesseract.exe'
    
    # Configuration, use whitelist
    options = r'--psm 6 --oem 3 tessedit_char_whitelist=HCIhci='
    
    # OCR the input image using Tesseract
    text_bw = pytesseract.image_to_string(out_file, config=options)
    print(text_bw)
    

    Output:

    H H
    \ /
    C=C
    / \
    
    H H