I'm currently doing some research to detect and locate a text-cursor (you know, the blinking rectangle shape that indicates the character position when you type on your computer) from a screen-record video. To do that, I've trained YOLOv4 model with custom object dataset (I took a reference from here) and planning to also implement DeepSORT to track the moving cursor.
Here's the example of training data I used to train YOLOv4:
Here's what I want to achieve:
Do you think using YOLOv4 + DeepSORT is considered overkill for this task? I'm asking because as of now, only 70%-80% of the video frame that contains the text-cursor can be successfully detected by the model. If it is overkill after all, do you know any other method that can be implemented for this task?
Anyway, I'm planning to detect the text-cursor not only from Visual Studio Code window, but also from Browser (e.g., Google Chrome) and Text Processor (e.g., Microsoft Word) as well. Something like this:
I'm considering the Sliding Window method as an alternative, but from what I've read, the method might consume much resources and perform slower. I'm also considering Template Matching from OpenCV (like this), but I don't think it will perform better and faster than the YOLOv4.
The constraint is about the performance speed (i.e, how many frames can be processed given amount of time) and the detection accuracy (i.e, I want to avoid letter 'l' or '1' detected as the text-cursor, since those characters are similar in some font). But higher accuracy with slower FPS is acceptable I think.
I'm currently using Python, Tensorflow, and OpenCV for this. Thank you very much!
This would work if the cursor is the only moving object on the screen. Here is the before and after:
Before:
After:
The code:
import cv2
import numpy as np
BOX_WIDTH = 10
BOX_HEIGHT = 20
def process_img(img):
img_gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
kernel = np.ones((5, 5))
img_canny = cv2.Canny(img_gray, 50, 50)
return img_canny
def get_contour(img):
contours, hierarchies = cv2.findContours(img, cv2.RETR_TREE, cv2.CHAIN_APPROX_NONE)
if contours:
return max(contours, key=cv2.contourArea)
def get_line_tip(cnt1, cnt2):
x1, y1, w1, h1 = cv2.boundingRect(cnt1)
if h1 > BOX_HEIGHT / 2:
if np.any(cnt2):
x2, y2, w2, h2 = cv2.boundingRect(cnt2)
if x1 < x2:
return x1, y1
return x1 + w1, y1
def get_rect(x, y):
half_width = BOX_WIDTH // 2
lift_height = BOX_HEIGHT // 6
return (x - half_width, y - lift_height), (x + half_width, y + BOX_HEIGHT - lift_height)
cap = cv2.VideoCapture("screen_record.mkv")
success, img_past = cap.read()
cnt_past = np.array([])
line_tip_past = 0, 0
while True:
success, img_live = cap.read()
if not success:
break
img_live_processed = process_img(img_live)
img_past_processed = process_img(img_past)
img_diff = cv2.bitwise_xor(img_live_processed, img_past_processed)
cnt = get_contour(img_diff)
line_tip = get_line_tip(cnt, cnt_past)
if line_tip:
cnt_past = cnt
line_tip_past = line_tip
else:
line_tip = line_tip_past
rect = get_rect(*line_tip)
img_past = img_live.copy()
cv2.rectangle(img_live, *rect, (0, 0, 255), 2)
cv2.imshow("Cursor", img_live)
if cv2.waitKey(1) & 0xFF == ord("q"):
break
cv2.destroyAllWindows()
Breaking it down:
import cv2
import numpy as np
BOX_WIDTH = 10
BOX_HEIGHT = 20
def process_img(img):
img_gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
kernel = np.ones((5, 5))
img_canny = cv2.Canny(img_gray, 50, 50)
return img_canny
def get_contour(img):
contours, hierarchies = cv2.findContours(img, cv2.RETR_TREE, cv2.CHAIN_APPROX_NONE)
if contours:
return max(contours, key=cv2.contourArea)
def get_line_tip(cnt1, cnt2):
x1, y1, w1, h1 = cv2.boundingRect(cnt1)
if h1 > BOX_HEIGHT / 2:
if np.any(cnt2):
x2, y2, w2, h2 = cv2.boundingRect(cnt2)
if x1 < x2:
return x1, y1
return x1 + w1, y1
BOX_WIDTH
and BOX_HEIGHT
constants defined earlier:def get_rect(x, y):
half_width = BOX_WIDTH // 2
lift_height = BOX_HEIGHT // 6
return (x - half_width, y - lift_height), (x + half_width, y + BOX_HEIGHT - lift_height)
cap = cv2.VideoCapture("screen_record.mkv")
success, img_past = cap.read()
cnt_past = np.array([])
line_tip_past = 0, 0
while
loop, and read from the video. Process the frame and the frame before that frame in the video:while True:
success, img_live = cap.read()
if not success:
break
img_live_processed = process_img(img_live)
img_past_processed = process_img(img_past)
cv2.bitwise_xor
method to get where the movement is on the screen. Then, we can find the contour of the movement between the 2 frames using the get_contour
function defined: img_diff = cv2.bitwise_xor(img_live_processed, img_past_processed)
cnt = get_contour(img_diff)
get_line_tip
function defined to find the tip of the cursor. If a tip was found, save it into the line_tip_past
variable to use for the next iteration, and if a tip was not found, we can us the past tip we saved as the current tip: line_tip = get_line_tip(cnt, cnt_past)
if line_tip:
cnt_past = cnt
line_tip_past = line_tip
else:
line_tip = line_tip_past
get_rect
function we defined earlier, and draw it onto the current frame. But before drawing it on, we save the frame to be the frame before the current frame of the next iteration: rect = get_rect(*line_tip)
img_past = img_live.copy()
cv2.rectangle(img_live, *rect, (0, 0, 255), 2)
cv2.imshow("Cursor", img_live)
if cv2.waitKey(1) & 0xFF == ord("q"):
break
cv2.destroyAllWindows()