python opencv augmented-reality object-detection homography

Improve homography estimation between two frames of a video for AR application with Python

I am supposed to improve an AR application in Phyton using OpenCV library making a frame to frame comparison. We have to project a image on a book cover that has to be detected on an existing video.

The idea was to take an homography between two successive frames to keep updated the homography between the first and the current frame in order to project the AR layer. I found issues in the correct estimation of the homography. It seems to collect errors at each update of the homography, probably due to the multiplication of matrices, that is repeated once each frame comparison. The result on the output video is an increasing bad positioning of the AR layer.

How can I fix my problem keeping the frame2frame approach?

Here there is the relevant part of the code:


[...]

#################################

img_array = []
success = True
success,img_trainCOLOUR = vid.read()
kp_query=kp_ref
des_query=des_ref

#get shapes of images
h,w = img_ref.shape[:2]
h_t, w_t = img_trainCOLOUR.shape[:2]
M_mask = np.identity(3, dtype='float64')
M_prev=M_ar

#performing iterations until last frame
while success :

    #obtain grayscale image of the current RGB frame
    img_train = cv2.cvtColor(img_trainCOLOUR, cv2.COLOR_BGR2GRAY)

    # Implementing the object detection pipeline
    # F2F method: correspondences between the previous video frame and the actual frame 
    kp_train = sift.detect(img_train)
    kp_train, des_train = sift.compute(img_train, kp_train)
    
    #find matches 
    flann = cv2.FlannBasedMatcher(index_params, search_params)
    matches = flann.knnMatch(des_query,des_train,k=2)
    
    #validating matches
    good = []
    for m ,n in matches:
        if m.distance < 0.7*n.distance:
            good.append(m)

    #checking if we found the object
    MIN_MATCH_COUNT = 10
    if len(good)>MIN_MATCH_COUNT: 

        #differenciate between source points and destination points
        src_pts = np.float32([ kp_query[m.queryIdx].pt for m in good ]).reshape(-1,1,2)
        dst_pts = np.float32([ kp_train[m.trainIdx].pt for m in good ]).reshape(-1,1,2)
    
        #find homography between current and previous video frames
        M1, mask = cv2.findHomography(src_pts, dst_pts, cv2.RANSAC, 5.0)
        #matchesMask = mask.ravel().tolist()

        #updated homography between M_mask which contains  from first to current-1 frame homography
        # and the current frame2frame
        M_mask=np.dot(M_mask,M1)
        #updated homography between M_prev which contains img_ar layer and first frame homography,
        # and from first to current-1 frame homography and the current frame2frame
        M = np.dot(M1, M_prev)
        
        #warping the img_ar (transformed as the first frame)
        warped = cv2.warpPerspective(img_arCOLOUR, M, (w_t, h_t), flags= cv2.INTER_LINEAR)
        warp_mask = cv2.warpPerspective(img_armask, M, (w_t, h_t), flags= cv2.INTER_LINEAR)
        
        #restore previous values of the train images where the mask is black
        warp_mask = np.equal(warp_mask, 0)
        warped[warp_mask] = img_trainCOLOUR[warp_mask]
         
        #inserting the frames into the frame array in order to reconstruct video sequence
        img_array.append(warped)
                   
        #save current homography for the successive iteration
        M_prev = M
        #save the current frame for the successive iteration
        img_query=img_train

        #warping the mask of the book cover as the current frame
        img_maskTrans = cv2.warpPerspective(img_mask, M_mask, (w_t, h_t), flags= cv2.INTER_NEAREST)

        #new sift object detection with the current frame and the current mask 
        # to search only the book cover into the next frame      
        kp_query=sift.detect(img_query,img_maskTrans)
        kp_query, des_query = sift.compute(img_query, kp_query)

        #reading next frame for the successive iteration
        success,img_trainCOLOUR = vid.read()

[...]

here there are input data, full code and output: https://drive.google.com/drive/folders/1EAI7wYVFy7SbNZs8Cet7fWEfK2usw-y1?usp=sharing

Thanks for the support

Solution

Your solution drifts because you always match to the previous image, rather than to a fixed reference one. Keep one of the images constant. Also, SIFT or any other descriptor-based matching method is overkill for short baseline tracking. You could just detect interest points (Shi-Tomasi goodFeaturesToTrack or Harris corners) and track them using Lucas-Kanade.