Multi process Video Processing

I would like to do video processing on neighboring frames. More specific, I would like to compute the mean square error between neighboring frames:

mean_squared_error(prev_frame,frame)

I know how to compute this in a linear straightforward way: I use the imutils package to utilize a queue to decouple loading the frames and processing them. By storing them in a queue, I don't need to wait for them before I can process them. ... but I want to be even faster...

# import the necessary packages to read the video
import imutils
from imutils.video import FileVideoStream
# package to compute mean squared errror
from skimage.metrics import mean_squared_error

if __name__ == '__main__':

    # SPECIFY PATH TO VIDEO FILE
    file = "VIDEO_PATH.mp4" 

    # START IMUTILS VIDEO STREAM
    print("[INFO] starting video file thread...")
    fvs = FileVideoStream(path_video, transform=transform_image).start()

    # INITALIZE LIST to store the results
    mean_square_error_list = []

    # READ PREVIOUS FRAME
    prev_frame = fvs.read()

    # LOOP over frames from the video file stream
    while fvs.more():

        # GRAP THE NEXT FRAME from the threaded video file stream
        frame = fvs.read()

        # COMPUTE the metric
        metric_val = mean_squared_error(prev_frame,frame)
        mean_square_error_list.append(1-metric_val) # Append to list

        # UPDATE previous frame variable 
        prev_frame = frame

Now my question is: How can I mutliprocess the computation of the metric to increase speed and save time ?

My operating system is Windows 10 and I am using python 3.8.0

Solution

There are too many aspects of making things faster, I'll only focus on the multiprocessing part.

As you don't want to read the whole video at a time, we have to read the video frame by frame.

I'll be using opencv (cv2), numpy for reading the frames, calculating mse, and saving the mse to disk.

First, we can start without any multiprocessing so we can benchmark our results. I'm using a video of 1920 by 1080 dimension, 60 FPS, duration: 1:29, size: 100 MB.

import cv2
import sys
import time

import numpy as np
import subprocess as sp
import multiprocessing as mp

filename = '2.mp4'

def process_video():    
    cap = cv2.VideoCapture(filename)

    proc_frames = 0

    mse = []
    prev_frame = None
    ret = True
    while ret:
        ret, frame = cap.read() # reading frames sequentially
        if ret == False:
            break

        if not (prev_frame is None):
            c_mse = np.mean(np.square(prev_frame-frame))
            mse.append(c_mse)

        prev_frame = frame

        proc_frames += 1

    np.save('data/' + 'sp' + '.npy', np.array(mse))

    cap.release()
    return


if __name__ == "__main__":

    t1 = time.time()

    process_video()

    t2 = time.time()

    print(t2-t1)

In my system, it runs for 142 secs.

Now, we can take the multiprocessing approach. The idea can be summarized in the following illustration.

GIF credit: Google

We make some segments (based on how many cpu cores we have) and process those segmented frames in parallel.

import cv2
import sys
import time

import numpy as np
import subprocess as sp
import multiprocessing as mp

filename = '2.mp4'

def process_video(group_number):    
    cap = cv2.VideoCapture(filename)
    num_processes = mp.cpu_count()
    frame_jump_unit = cap.get(cv2.CAP_PROP_FRAME_COUNT) // num_processes
    cap.set(cv2.CAP_PROP_POS_FRAMES, frame_jump_unit * group_number)
    proc_frames = 0

    mse = []
    prev_frame = None
    while proc_frames < frame_jump_unit:
        ret, frame = cap.read()
        if ret == False:
            break

        if not (prev_frame is None):
            c_mse = np.mean(np.square(prev_frame-frame))
            mse.append(c_mse)

        prev_frame = frame

        proc_frames += 1

    np.save('data/' + str(group_number) + '.npy', np.array(mse))

    cap.release()
    return


if __name__ == "__main__":

    t1 = time.time()

    num_processes =  mp.cpu_count()
    print(f'CPU: {num_processes}')

    # only meta-data
    cap = cv2.VideoCapture(filename)

    width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
    height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
    fps = cap.get(cv2.CAP_PROP_FPS)
    frame_jump_unit = cap.get(cv2.CAP_PROP_FRAME_COUNT) // num_processes
    cap.release()

    p = mp.Pool(num_processes)
    p.map(process_video, range(num_processes))

    # merging



    # the missing mse will be 

    final_mse = []
    for i in range(num_processes):
        na = np.load(f'data/{i}.npy')
        final_mse.extend(na)


        try:
            cap = cv2.VideoCapture(filename) # you could also take it outside the loop to reduce some overhead
            frame_no = (frame_jump_unit) * (i+1) - 1
            print(frame_no)
            cap.set(1, frame_no)
            _, frame1 = cap.read()
            #cap.set(1, ((frame_jump_unit) * (i+1)))
            _, frame2 = cap.read()
            c_mse = np.mean(np.square(frame1-frame2))
            final_mse.append(c_mse)
            cap.release()
        except:
            print('failed in 1 case')
            # in the last few frames, nothing left
            pass




    t2 = time.time()

    print(t2-t1)

    np.save(f'data/final_mse.npy', np.array(final_mse))

I'm using just numpy save to save the partial results, you can try something better.

This one runs for 49.56 secs with my cpu_count = 12. There are definitely some bottlenecks that can be avoided to make it run faster.

The only issue with my implementation is, it's missing the mse for regions where the video was segmented, it's pretty easy to add. As we can index individual frames at any location with OpenCV in O(1), we can just go to those locations and calculate mse separately and merge to the final solution. [Check the updated code it fixes the merging part]

You can write a simple sanity check to ensure, both provide the same result.

import numpy as np

a = np.load('data/sp.npy')

b = np.load('data/final_mse.npy')

print(a.shape)

print(b.shape)

print(a[:10])

print(b[:10])

for i in range(len(a)):
    if a[i] != b[i]:
        print(i)

Now, some additional speedups can come from using a CUDA-compiled opencv, ffmpeg, adding queuing mechanism plus multiprocessing, etc.