Search code examples
pythonocrtesseractpython-tesseract

How to extract data from an image accurately? Using PyTesseract


I am trying to extract text from an image accurately using python.

This is the image I am using in this scenario:

Image 1

This is my python file:

from PIL import Image
import pytesseract

pytesseract.pytesseract.tesseract_cmd = r'C:\Users\test\AppData\Roaming\Python\Python37\site-packages\tesseract.exe'

img=Image.open('C:/Users/test/Desktop/Everything else/work/Almonds.jpg')

text = pytesseract.image_to_string(img, lang = 'eng')


print(text)

And this is the output when I run the python file on Command Prompt:

INGREDIENTS: Almonds: [Nuts] Allergy Advice:
For allergens, see ingredients in Bold

Nutritional Information
TYPICALVALUES Per 100g

Energy kJ 2597.0}
Energy kcal 626.0)
Fat 50.6g|

of which Saturates 3.9g

Carbohy drate 19.7g

of which Sugars 4.89|
Fibre 3.59
Protein 21.3g|

May contain traces of
other nuts, peanut,
sesame or gluten

This product may contain
pieces of shell

Store in a cool dry place
jout of direct sunlight

Net weight:



Salt 0.ig

For Best Before & Batch see pack 1 k

As you see not all text is spelled correctly. Is there any recommendations to improve the text output accuracy?

EXTRA

Here is an idea of what I am trying to achieve, irrelevant to the question but give you an idea of what I am trying to achieve here.

I have multiple image files of products where I will compare to an excel sheet.

Excel sheet is formatted in the following way (1 example data):

Product Code: 0001
Product Desc: Californian Whole Almonds
Ingredients: Almonds: [Nuts]
Allergy Advice: True
etc...

Then I will code a script which will detect the text within the image file, compare it to an excel sheet and analyse each sections if they match up together, giving outputs of 'True' or 'False'


Solution

  • Preprocessing the image to smooth/remove noise before throwing it into Pytesseract can help. Maybe removing the horizontal/vertical lines will improve detection

    enter image description here

    import cv2
    
    image = cv2.imread('1.jpg',0)
    thresh = cv2.threshold(image, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]
    
    # Remove horizontal lines
    horizontal_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (25,1))
    detect_horizontal = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, horizontal_kernel, iterations=2)
    cnts = cv2.findContours(detect_horizontal, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
    cnts = cnts[0] if len(cnts) == 2 else cnts[1]
    cv2.fillPoly(thresh, cnts, [0,0,0])
    
    # Remove vertical lines
    vertical_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (1,45))
    detect_vertical = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, vertical_kernel, iterations=2)
    cnts = cv2.findContours(detect_vertical, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
    cnts = cnts[0] if len(cnts) == 2 else cnts[1]
    cv2.fillPoly(thresh, cnts, [0,0,0])
    
    result = 255 - thresh
    
    cv2.imshow('thresh', thresh)
    cv2.imshow('result', result)
    cv2.waitKey()