Search code examples
pythonopencvcomputer-visionobject-detectionyolo

Object detection model for detecting rectangular shape (text cursor) in a video?


I'm currently doing some research to detect and locate a text-cursor (you know, the blinking rectangle shape that indicates the character position when you type on your computer) from a screen-record video. To do that, I've trained YOLOv4 model with custom object dataset (I took a reference from here) and planning to also implement DeepSORT to track the moving cursor.

Here's the example of training data I used to train YOLOv4:

training data sample

training data sample 2

Here's what I want to achieve:

Blockquote

Do you think using YOLOv4 + DeepSORT is considered overkill for this task? I'm asking because as of now, only 70%-80% of the video frame that contains the text-cursor can be successfully detected by the model. If it is overkill after all, do you know any other method that can be implemented for this task?

Anyway, I'm planning to detect the text-cursor not only from Visual Studio Code window, but also from Browser (e.g., Google Chrome) and Text Processor (e.g., Microsoft Word) as well. Something like this:

Blockquote

Blockquote

I'm considering the Sliding Window method as an alternative, but from what I've read, the method might consume much resources and perform slower. I'm also considering Template Matching from OpenCV (like this), but I don't think it will perform better and faster than the YOLOv4.

The constraint is about the performance speed (i.e, how many frames can be processed given amount of time) and the detection accuracy (i.e, I want to avoid letter 'l' or '1' detected as the text-cursor, since those characters are similar in some font). But higher accuracy with slower FPS is acceptable I think.

I'm currently using Python, Tensorflow, and OpenCV for this. Thank you very much!


Solution

  • This would work if the cursor is the only moving object on the screen. Here is the before and after:

    Before:

    enter image description here

    After:

    enter image description here

    The code:

    import cv2
    import numpy as np
    
    BOX_WIDTH = 10
    BOX_HEIGHT = 20
    
    def process_img(img):
        img_gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
        kernel = np.ones((5, 5))
        img_canny = cv2.Canny(img_gray, 50, 50)
        return img_canny
    
    def get_contour(img):
        contours, hierarchies = cv2.findContours(img, cv2.RETR_TREE, cv2.CHAIN_APPROX_NONE)
        if contours:
            return max(contours, key=cv2.contourArea)
    
    def get_line_tip(cnt1, cnt2):
        x1, y1, w1, h1 = cv2.boundingRect(cnt1)
    
        if h1 > BOX_HEIGHT / 2:
            if np.any(cnt2):
                x2, y2, w2, h2 = cv2.boundingRect(cnt2)
                if x1 < x2:
                    return x1, y1
            return x1 + w1, y1
    
    def get_rect(x, y):
        half_width = BOX_WIDTH // 2
        lift_height = BOX_HEIGHT // 6
        return (x - half_width, y - lift_height), (x + half_width, y + BOX_HEIGHT - lift_height)
    
    cap = cv2.VideoCapture("screen_record.mkv")
    success, img_past = cap.read()
    
    cnt_past = np.array([])
    line_tip_past = 0, 0
    
    while True:
        success, img_live = cap.read()
    
        if not success:
            break
    
        img_live_processed = process_img(img_live)
        img_past_processed = process_img(img_past)
    
        img_diff = cv2.bitwise_xor(img_live_processed, img_past_processed)
        cnt = get_contour(img_diff)
    
        line_tip = get_line_tip(cnt, cnt_past)
    
        if line_tip:
            cnt_past = cnt
            line_tip_past = line_tip
        else:
            line_tip = line_tip_past
    
        rect = get_rect(*line_tip)
        img_past = img_live.copy()
        cv2.rectangle(img_live, *rect, (0, 0, 255), 2)
    
        cv2.imshow("Cursor", img_live)
        if cv2.waitKey(1) & 0xFF == ord("q"):
            break
        
    cv2.destroyAllWindows()
    

    Breaking it down:

    1. Import the necessary libraries:
    import cv2
    import numpy as np
    
    1. Define the size of the tracking box depending on the size of the cursor:
    BOX_WIDTH = 10
    BOX_HEIGHT = 20
    
    1. Define a function to process the frames into edges:
    def process_img(img):
        img_gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
        kernel = np.ones((5, 5))
        img_canny = cv2.Canny(img_gray, 50, 50)
        return img_canny
    
    1. Define a function that would retrieve the contour with the greatest area in an image (the cursor doesn't need to be large for this to work, it can be tiny if needed):
    def get_contour(img):
        contours, hierarchies = cv2.findContours(img, cv2.RETR_TREE, cv2.CHAIN_APPROX_NONE)
        if contours:
            return max(contours, key=cv2.contourArea)
    
    1. Define a function that will take in 2 contours, one being the contour of the cursor + some text for the current frame, the other being the contour + some stray text for the contour of the cursor + some text from the frame before. With the two contours, we can identify if the cursor is moving left or right:
    def get_line_tip(cnt1, cnt2):
        x1, y1, w1, h1 = cv2.boundingRect(cnt1)
    
        if h1 > BOX_HEIGHT / 2:
            if np.any(cnt2):
                x2, y2, w2, h2 = cv2.boundingRect(cnt2)
                if x1 < x2:
                    return x1, y1
            return x1 + w1, y1
    
    1. Define a function that will take in the tip points of the cursor, and return a box based on the BOX_WIDTH and BOX_HEIGHT constants defined earlier:
    def get_rect(x, y):
        half_width = BOX_WIDTH // 2
        lift_height = BOX_HEIGHT // 6
        return (x - half_width, y - lift_height), (x + half_width, y + BOX_HEIGHT - lift_height)
    
    1. Define a capture devices for the video, and remove one frame from the start of the video and store it in a variable that will be used as the frame before every frame. Also define temporary values for the past contour and past line tip:
    cap = cv2.VideoCapture("screen_record.mkv")
    success, img_past = cap.read()
    
    cnt_past = np.array([])
    line_tip_past = 0, 0
    
    1. Use a while loop, and read from the video. Process the frame and the frame before that frame in the video:
    while True:
        success, img_live = cap.read()
        if not success:
            break
        img_live_processed = process_img(img_live)
        img_past_processed = process_img(img_past)
    
    1. With the processed frames, we can find the difference between the frame using the cv2.bitwise_xor method to get where the movement is on the screen. Then, we can find the contour of the movement between the 2 frames using the get_contour function defined:
        img_diff = cv2.bitwise_xor(img_live_processed, img_past_processed)
        cnt = get_contour(img_diff)
    
    1. With the contour, we can utilize the get_line_tip function defined to find the tip of the cursor. If a tip was found, save it into the line_tip_past variable to use for the next iteration, and if a tip was not found, we can us the past tip we saved as the current tip:
        line_tip = get_line_tip(cnt, cnt_past)
    
        if line_tip:
            cnt_past = cnt
            line_tip_past = line_tip
        else:
            line_tip = line_tip_past
    
    1. Now we define a rectangle using the cursor tip and the get_rect function we defined earlier, and draw it onto the current frame. But before drawing it on, we save the frame to be the frame before the current frame of the next iteration:
        rect = get_rect(*line_tip)
        img_past = img_live.copy()
        cv2.rectangle(img_live, *rect, (0, 0, 255), 2)
    
    1. Finally, we display the frame:
        cv2.imshow("Cursor", img_live)
        if cv2.waitKey(1) & 0xFF == ord("q"):
            break
        
    cv2.destroyAllWindows()