python opencv computer-vision object-detection yolo

Object detection model for detecting rectangular shape (text cursor) in a video?

I'm currently doing some research to detect and locate a text-cursor (you know, the blinking rectangle shape that indicates the character position when you type on your computer) from a screen-record video. To do that, I've trained YOLOv4 model with custom object dataset (I took a reference from here) and planning to also implement DeepSORT to track the moving cursor.

Here's the example of training data I used to train YOLOv4:

Here's what I want to achieve:

Do you think using YOLOv4 + DeepSORT is considered overkill for this task? I'm asking because as of now, only 70%-80% of the video frame that contains the text-cursor can be successfully detected by the model. If it is overkill after all, do you know any other method that can be implemented for this task?

Anyway, I'm planning to detect the text-cursor not only from Visual Studio Code window, but also from Browser (e.g., Google Chrome) and Text Processor (e.g., Microsoft Word) as well. Something like this:

I'm considering the Sliding Window method as an alternative, but from what I've read, the method might consume much resources and perform slower. I'm also considering Template Matching from OpenCV (like this), but I don't think it will perform better and faster than the YOLOv4.

The constraint is about the performance speed (i.e, how many frames can be processed given amount of time) and the detection accuracy (i.e, I want to avoid letter 'l' or '1' detected as the text-cursor, since those characters are similar in some font). But higher accuracy with slower FPS is acceptable I think.

I'm currently using Python, Tensorflow, and OpenCV for this. Thank you very much!

Solution

This would work if the cursor is the only moving object on the screen. Here is the before and after:

Before:

After:

The code:

import cv2
import numpy as np

BOX_WIDTH = 10
BOX_HEIGHT = 20

def process_img(img):
    img_gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    kernel = np.ones((5, 5))
    img_canny = cv2.Canny(img_gray, 50, 50)
    return img_canny

def get_contour(img):
    contours, hierarchies = cv2.findContours(img, cv2.RETR_TREE, cv2.CHAIN_APPROX_NONE)
    if contours:
        return max(contours, key=cv2.contourArea)

def get_line_tip(cnt1, cnt2):
    x1, y1, w1, h1 = cv2.boundingRect(cnt1)

    if h1 > BOX_HEIGHT / 2:
        if np.any(cnt2):
            x2, y2, w2, h2 = cv2.boundingRect(cnt2)
            if x1 < x2:
                return x1, y1
        return x1 + w1, y1

def get_rect(x, y):
    half_width = BOX_WIDTH // 2
    lift_height = BOX_HEIGHT // 6
    return (x - half_width, y - lift_height), (x + half_width, y + BOX_HEIGHT - lift_height)

cap = cv2.VideoCapture("screen_record.mkv")
success, img_past = cap.read()

cnt_past = np.array([])
line_tip_past = 0, 0

while True:
    success, img_live = cap.read()

    if not success:
        break

    img_live_processed = process_img(img_live)
    img_past_processed = process_img(img_past)

    img_diff = cv2.bitwise_xor(img_live_processed, img_past_processed)
    cnt = get_contour(img_diff)

    line_tip = get_line_tip(cnt, cnt_past)

    if line_tip:
        cnt_past = cnt
        line_tip_past = line_tip
    else:
        line_tip = line_tip_past

    rect = get_rect(*line_tip)
    img_past = img_live.copy()
    cv2.rectangle(img_live, *rect, (0, 0, 255), 2)

    cv2.imshow("Cursor", img_live)
    if cv2.waitKey(1) & 0xFF == ord("q"):
        break
    
cv2.destroyAllWindows()

Breaking it down:

Import the necessary libraries:

import cv2
import numpy as np

Define the size of the tracking box depending on the size of the cursor:

BOX_WIDTH = 10
BOX_HEIGHT = 20

Define a function to process the frames into edges:

def process_img(img):
    img_gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    kernel = np.ones((5, 5))
    img_canny = cv2.Canny(img_gray, 50, 50)
    return img_canny

Define a function that would retrieve the contour with the greatest area in an image (the cursor doesn't need to be large for this to work, it can be tiny if needed):

def get_contour(img):
    contours, hierarchies = cv2.findContours(img, cv2.RETR_TREE, cv2.CHAIN_APPROX_NONE)
    if contours:
        return max(contours, key=cv2.contourArea)

Define a function that will take in 2 contours, one being the contour of the cursor + some text for the current frame, the other being the contour + some stray text for the contour of the cursor + some text from the frame before. With the two contours, we can identify if the cursor is moving left or right:

def get_line_tip(cnt1, cnt2):
    x1, y1, w1, h1 = cv2.boundingRect(cnt1)

    if h1 > BOX_HEIGHT / 2:
        if np.any(cnt2):
            x2, y2, w2, h2 = cv2.boundingRect(cnt2)
            if x1 < x2:
                return x1, y1
        return x1 + w1, y1

Define a function that will take in the tip points of the cursor, and return a box based on the BOX_WIDTH and BOX_HEIGHT constants defined earlier:

def get_rect(x, y):
    half_width = BOX_WIDTH // 2
    lift_height = BOX_HEIGHT // 6
    return (x - half_width, y - lift_height), (x + half_width, y + BOX_HEIGHT - lift_height)

Define a capture devices for the video, and remove one frame from the start of the video and store it in a variable that will be used as the frame before every frame. Also define temporary values for the past contour and past line tip:

cap = cv2.VideoCapture("screen_record.mkv")
success, img_past = cap.read()

cnt_past = np.array([])
line_tip_past = 0, 0

Use a while loop, and read from the video. Process the frame and the frame before that frame in the video:

while True:
    success, img_live = cap.read()
    if not success:
        break
    img_live_processed = process_img(img_live)
    img_past_processed = process_img(img_past)

With the processed frames, we can find the difference between the frame using the cv2.bitwise_xor method to get where the movement is on the screen. Then, we can find the contour of the movement between the 2 frames using the get_contour function defined:

    img_diff = cv2.bitwise_xor(img_live_processed, img_past_processed)
    cnt = get_contour(img_diff)

With the contour, we can utilize the get_line_tip function defined to find the tip of the cursor. If a tip was found, save it into the line_tip_past variable to use for the next iteration, and if a tip was not found, we can us the past tip we saved as the current tip:

    line_tip = get_line_tip(cnt, cnt_past)

    if line_tip:
        cnt_past = cnt
        line_tip_past = line_tip
    else:
        line_tip = line_tip_past

Now we define a rectangle using the cursor tip and the get_rect function we defined earlier, and draw it onto the current frame. But before drawing it on, we save the frame to be the frame before the current frame of the next iteration:

    rect = get_rect(*line_tip)
    img_past = img_live.copy()
    cv2.rectangle(img_live, *rect, (0, 0, 255), 2)

Finally, we display the frame:

    cv2.imshow("Cursor", img_live)
    if cv2.waitKey(1) & 0xFF == ord("q"):
        break
    
cv2.destroyAllWindows()