python opencv computer-vision feature-descriptor feature-matching

Find a fragment in the whole image

Globally, my task is to determine the similarity / dissimilarity of two .jpg files. Below I will describe the process in more detail. I have five (in reality there are more) template .jpg files. And I have a new .jpg file, which I must match with each template .jpg file to make a decision - is the new .jpg file similar to any of the template .jpg files or not.

Correlating entire files in my case is a bad idea, since the error is large. So I came up with a way to "cut" the new file into 12 equal parts (fragments) (that is, into 12 .jpg files) and search for each individual fragment in the template.

For this, I used the tutorial https://docs.opencv.org/4.x/dc/dc3/tutorial_py_matcher.html

But the problem is that the fragments from the new .jpg file are extremely incorrectly matched with the template.

Below I will show an example: Let's take the document below as a template

And the document below as a new document (schematically I cut it into 12 parts, that is, I receive one document as input, but I cut this one document into 12 parts (fragments) (that is, 12 new files))

Next, take a look at my code. The gist of it is that I take each of the 12 fragments and search for that fragment in the template

def match_slices_in_template(path_template):
    directory_in_str = 'slices'
    directory = os.fsencode(directory_in_str)

    img1 = cv.imread(path_template, cv.IMREAD_GRAYSCALE)  # queryImage

    good = []
    for slice_image in os.listdir(directory):
        print(slice_image)
        filename = os.fsdecode(slice_image)
        img2 = cv.imread(f'slices/{filename}', cv.IMREAD_GRAYSCALE)  # trainImage

        # Initiate SIFT detector
        sift = cv.SIFT_create()

        # find the keypoints and descriptors with SIFT
        kp1, des1 = sift.detectAndCompute(img1, None)
        kp2, des2 = sift.detectAndCompute(img2, None)

        # BFMatcher with default params
        bf = cv.BFMatcher()
        matches = bf.knnMatch(des1, des2, k=2)

        # Apply ratio test
        for m, n in matches:
            if m.distance < 0.3 * n.distance:
                good.append([m])

        # cv.drawMatchesKnn expects list of lists as matches.
        img3 = cv.drawMatchesKnn(img1, kp1, img2, kp2, good, None, flags=cv.DrawMatchesFlags_NOT_DRAW_SINGLE_POINTS)

        plt.imshow(img3), plt.show()

print(match_slices_in_template('Bill_1.jpg'))

But the search result is completely incorrect, take a look at some sample graphs that matplotlib built

In my example, in fact, the two files are different (although they have a lot in common). But the program identifies them as quite similar.

So you probably understand the essence of my question: how to improve the performance of the algorithm so that it would determine similarity/difference more accurately than now

Solution

I wrote up an example to guide you in the direction of classifying by relying on text instead of structural similarity between the documents.

My folder is organized like this:

I have two training images and three test images. The training images are in the template folder and the images I want to classify are in the toBeClassified folder. Here are my input images:

Now that we got that out of the way, here's a function that can get text from the images:

def text_from_image(filename):
    '''
    Load image as grayscale by using cv2.irmead(_, 0)
    extract text using pytesseract, we use the text for classification
    '''
    imGray = cv2.imread(filename, 0)
    text = pytesseract.image_to_string(imGray)
    return text

Now that we have the text, we can use it together with some labels that we put as training reponses to train the model:

texts = [text_from_image("Templates/"+filename) for filename in os.listdir("Templates/")] # get the texts
labels = ["Shipping", "Invoice"] # label the texts

These lines are used to get the model and to train it on the templates:

model = make_pipeline(TfidfVectorizer(), MultinomialNB()) # init the model
model.fit(texts, labels) # fit the model

Now we can go with two approaches, either we are just satisfied with the class with the highest probability or we go with predicting the probabilities.

The first approach:

def classify_with_model(filename, model):
    '''
    We pass the filename of the images to be classified and the model to get the 
    class with the highest probability, not very good but good for now
    '''
    text = text_from_image(filename)
    category = model.predict([text])[0]
    return category

If we loop through the test images, we get:

for filename in os.listdir("toBeClassified/"): # loop through the test images
    category = classify_with_model("toBeClassified/"+filename, model) # get the class with the highest prob
    print(f'Document {filename} is classified as {category}') # print out results

# Document Bill1.png is classified as Invoice
# Document Shipping1.jpg is classified as Shipping
# Document ShippingInvoice1.png is classified as Invoice

Notice the last image, is kind of a hybrid that I found online. So looking at the probabilities is quite essential in some cases. As the joke when it comes to SVM learners: A healthy person goes to the doctor for a cancer screening. The doctor uses a state-of-the-art SVM machine learning model with 100% accuracy in identifying different types of cancer. After the test, the doctor comes back and says, "Good news and bad news. The good news: our model is perfect at diagnosing every type of cancer. The bad news: it doesn't know how to say you're healthy."

In any case, no time for humour, or my attempt at it. We go with probabilities:

def classify_with_probability(filename, model):
    '''
    This is to classify with probability, a bit better to see the confidences
    '''
    text = text_from_image(filename)
    probabilities = model.predict_proba([text])[0]  # get probabilities for each class
    categories = model.classes_  # get the class labels
    bestMatchIdx = probabilities.argmax()  # index of the highest probability
    bestMatch = categories[bestMatchIdx]  # class with the highest probability
    confidence = probabilities[bestMatchIdx]  # probability of the best match
    return bestMatch, dict(zip(categories, probabilities)) # return everythin

And the results are:

for filename in os.listdir("toBeClassified/"):
    category = classify_with_probability("toBeClassified/"+filename, model)
    print(f'Document {filename} is classified as {category[0]} with the following confidence:')
    print(category[1])
    print("_______________________")
#Document Bill1.png is classified as Invoice with the following confidence:
#{'Invoice': 0.5693790615221315, 'Shipping': 0.4306209384778673}
#_______________________
#Document Shipping1.jpg is classified as Shipping with the following confidence:
#{'Invoice': 0.38458825025403914, 'Shipping': 0.6154117497459587}
#_______________________
#Document ShippingInvoice1.png is classified as Invoice with the following confidence:
#{'Invoice': 0.5519774748181495, 'Shipping': 0.4480225251818504}
#_______________________

Here, we see the probabilities as well, this can be helpful if you want to classify your images into multiple clases at the same time, e.g., invoice and shipping form.

I hope this helps you in some way, as I said, and as Christoph mentioned, going with features on those documents will be a wild attempt.

The imports:

import cv2
import pytesseract
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline
import os
pytesseract.pytesseract.tesseract_cmd = "C:/Program Files/Tesseract-OCR/tesseract.exe"