Alpha Blending using OpenCV for videos

I want to blend a video on top of another one using an alpha video. This is my code. It works perfectly but the problem is that this code isn't efficient at all and that's because of /255 parts. It is slow and has lagging probelm.

Is there a standard and efficient way for doing this? I want results be real-time. Thanks

import cv2
import numpy as np

def main():
    foreground = cv2.VideoCapture('circle.mp4')
    background = cv2.VideoCapture('video.MP4')
    alpha = cv2.VideoCapture('circle_alpha.mp4')

    while foreground.isOpened():
        fr_foreground = foreground.read()[1]/255
        fr_background = background.read()[1]/255     
        fr_alpha = alpha.read()[1]/255

        cv2.imshow('My Image',cmb(fr_foreground,fr_background,fr_alpha))

        if cv2.waitKey(1) == ord('q'): break

    cv2.destroyAllWindows

def cmb(fg,bg,a):
    return fg * a + bg * (1-a)

if __name__ == '__main__':
    main()

Solution

Let's get few obvious problems out of the way first - foreground.isOpened() will return true even after you have reached the end of the video, so your program will end up crashing at that point. The solution is twofold. First of all, test all 3 VideoCapture instances right after you create them, using something like:

if not foreground.isOpened() or not background.isOpened() or not alpha.isOpened():
    print "Unable to open input videos."
    return

That will make sure that all of them opened properly. Next part is tho correctly handle reaching the end of the video. That means either checking the first of the two return values of read(), which is a boolean flag representing success, or test whether the frame is None.

while True:
    r_fg, fr_foreground = foreground.read()
    r_bg, fr_background = background.read()
    r_a, fr_alpha = alpha.read()
    if not r_fg or not r_bg or not r_a:
        break # End of video

Additionally, it seems you don't actually call cv2.destroyAllWindows() -- the () is missing. Not that this really matters.

To help investigate and optimize this, i've added some detailed timing, using the timeit module and couple of convenience functions

from timeit import default_timer as timer

def update_times(times, total_times):
    for i in range(len(times) - 1):
        total_times[i] += (times[i+1]-times[i]) * 1000

def print_times(total_times, n):
    print "Iterations: %d" % n
    for i in range(len(total_times)):
        print "Step %d: %0.4f ms" % (i, total_times[i] / n)
    print "Total: %0.4f ms" % (np.sum(total_times) / n)

and modified the main() function to measure the time taken by each logical step -- read, scale, blend, show, waitKey. To do this I split the division into separate statements. I also made a slight modification that makes this work in Python 2.x as well (/255 is interpeted as integer division and yields wrong results).

times = [0.0] * 6
total_times = [0.0] * (len(times) - 1)
n = 0
while True:
    times[0] = timer()
    r_fg, fr_foreground = foreground.read()
    r_bg, fr_background = background.read()
    r_a, fr_alpha = alpha.read()
    if not r_fg or not r_bg or not r_a:
        break # End of video
    times[1] = timer()
    fr_foreground = fr_foreground / 255.0
    fr_background = fr_background / 255.0
    fr_alpha = fr_alpha / 255.0
    times[2] = timer()
    result = cmb(fr_foreground,fr_background,fr_alpha)
    times[3] = timer()
    cv2.imshow('My Image', result)
    times[4] = timer()
    if cv2.waitKey(1) == ord('q'): break
    times[5] = timer()
    update_times(times, total_times)
    n += 1

print_times(total_times, n)

When I run this with 1280x800 mp4 videos as input, I notice that it really is rather sluggish, and that it only uses 15% CPU on my 6 core machine. The timing of the sections is as follows:

Iterations: 1190
Step 0: 11.4385 ms
Step 1: 37.1320 ms
Step 2: 39.4083 ms
Step 3: 2.5488 ms
Step 4: 10.7083 ms
Total: 101.2358 ms

This suggests that the biggest bottlenecks are both the scaling step and the blending step. The low CPU usage is also suboptimal, but let's focus on the low-hanging fruit first.

Let's look at the data types of the numpy arrays we use. read() gives us arrays with dtype of np.uint8 -- 8bit unsigned integers. However, the floating point division (as written) will yield an array with dtype of np.float64 -- 64bit floating point values. We don't really need this level of precision for our algorithm, so we'd be better of using only 32bit floats -- it would mean that if any of the operations are vectorized, we can potentially do twice as many calculations in the same amount of time.

There are two options here. We could simply cast the divisor to np.float32, which will cause numpy to give us the result with the same dtype:

fr_foreground = fr_foreground / np.float32(255.0)
fr_background = fr_background / np.float32(255.0)
fr_alpha = fr_alpha / np.float32(255.0)

Which gives us the following timings:

Iterations: 1786
Step 0: 9.2550 ms
Step 1: 19.0144 ms
Step 2: 21.2120 ms
Step 3: 1.4662 ms
Step 4: 10.8889 ms
Total: 61.8365 ms

Or we could cast the array to np.float32 first, and then do the scaling in-place.

fr_foreground = np.float32(fr_foreground)
fr_background = np.float32(fr_background)
fr_alpha = np.float32(fr_alpha)

fr_foreground /= 255.0
fr_background /= 255.0
fr_alpha /= 255.0

Which gives the following timings (splitting step 1 into conversion (1) and scaling (2) -- rest shifts by 1):

Iterations: 1786
Step 0: 9.0589 ms
Step 1: 13.9614 ms
Step 2: 4.5960 ms
Step 3: 20.9279 ms
Step 4: 1.4631 ms
Step 5: 10.4396 ms
Total: 60.4469 ms

Both are roughtly equivalent, running at ~60% of the original time. I'll stick with the second option, since it will become useful in the later steps. Let's see what else we can improve.

From the previous timings, we can see that the scaling is no longer the bottleneck, but an idea still comes to mind -- division is generally slower than multiplication, so what if we multiplied by a reciprocal?

fr_foreground *= 1/255.0
fr_background *= 1/255.0
fr_alpha *= 1/255.0

Indeed this does gain us a millisecond -- nothing spectacular, but it was easy, so might as well go with it:

Iterations: 1786
Step 0: 9.1843 ms
Step 1: 14.2349 ms
Step 2: 3.5752 ms
Step 3: 21.0545 ms
Step 4: 1.4692 ms
Step 5: 10.6917 ms
Total: 60.2097 ms

Now the blending function is the biggest bottleneck, followed by the typecast of all 3 arrays. If we look at what the blending operation does:

foreground * alpha + background * (1.0 - alpha)

we can observe that for the math to work, the only value that needs to be in range (0.0, 1.0) is alpha.

What if we only scaled only the alpha image? Also, since multiplication by floating point will promote to floating point, what if we also skipped the type conversion? That would mean cmb() would have to return np.uint8 array

def cmb(fg,bg,a):
    return np.uint8(fg * a + bg * (1-a))

and we would have

    #fr_foreground = np.float32(fr_foreground)
    #fr_background = np.float32(fr_background)
    fr_alpha = np.float32(fr_alpha)

    #fr_foreground *= 1/255.0
    #fr_background *= 1/255.0
    fr_alpha *= 1/255.0

The times for this are

Step 0: 7.7023 ms
Step 1: 4.6758 ms
Step 2: 1.1061 ms
Step 3: 27.3188 ms
Step 4: 0.4783 ms
Step 5: 9.0027 ms
Total: 50.2840 ms

Obviously, steps 1 and 2 are much faster, since we only do 1/3 of the work. imshow also speeds up, since it doens't have to convert from floating point. Inexplicably, the reads also got faster (I guess we're avoiding some under the hood reallocations, since fr_foreground and fr_background always contain the pristine frame). We do pay the price of an additional cast in cmb(), but overall this seems a win -- we're at 50% of the original time.

To continue, let's get rid of the cmb() function, move its functionality to main() and split it up to measure the cost of each of the operations. Let's also try to reuse the result of alpha.read() (since we recently saw that improvement in read() performance):

times = [0.0] * 11
total_times = [0.0] * (len(times) - 1)
n = 0
while True:
    times[0] = timer()
    r_fg, fr_foreground = foreground.read()
    r_bg, fr_background = background.read()
    r_a, fr_alpha_raw = alpha.read()
    if not r_fg or not r_bg or not r_a:
        break # End of video

    times[1] = timer()
    fr_alpha = np.float32(fr_alpha_raw)
    times[2] = timer()
    fr_alpha *= 1/255.0
    times[3] = timer()
    fr_alpha_inv = 1.0 - fr_alpha
    times[4] = timer()
    fr_fg_weighed = fr_foreground * fr_alpha
    times[5] = timer()
    fr_bg_weighed = fr_background * fr_alpha_inv
    times[6] = timer()
    sum = fr_fg_weighed + fr_bg_weighed
    times[7] = timer()
    result = np.uint8(sum)
    times[8] = timer()
    cv2.imshow('My Image', result)
    times[9] = timer()
    if cv2.waitKey(1) == ord('q'): break
    times[10] = timer()
    update_times(times, total_times)
    n += 1

New timings:

Iterations: 1786
Step 0: 6.8733 ms
Step 1: 5.2742 ms
Step 2: 1.1430 ms
Step 3: 4.5800 ms
Step 4: 7.0372 ms
Step 5: 7.0675 ms
Step 6: 5.3082 ms
Step 7: 2.6912 ms
Step 8: 0.4658 ms
Step 9: 9.6966 ms
Total: 50.1372 ms

We didn't really gain anything, but the reads got noticeably faster.

This leads to another idea -- what if we tried to minimize allocations and reuse the arrays in subsequent iterations?

We can pre-allocate the necessary arrays in the first iteration (using numpy.zeros_like), after we read the first set of frames:

if n == 0: # Pre-allocate
    fr_alpha = np.zeros_like(fr_alpha_raw, np.float32)
    fr_alpha_inv = np.zeros_like(fr_alpha_raw, np.float32)
    fr_fg_weighed = np.zeros_like(fr_alpha_raw, np.float32)
    fr_bg_weighed = np.zeros_like(fr_alpha_raw, np.float32)
    sum = np.zeros_like(fr_alpha_raw, np.float32)
    result = np.zeros_like(fr_alpha_raw, np.uint8)

Now, we can use

numpy.add for addition
numpy.subtract for subtraction
numpy.multiply for multiplication
numpy.copyto for type conversion

We can also merge steps 1 and 2 together, using a single numpy.multiply.

times = [0.0] * 10
total_times = [0.0] * (len(times) - 1)
n = 0
while True:
    times[0] = timer()
    r_fg, fr_foreground = foreground.read()
    r_bg, fr_background = background.read()
    r_a, fr_alpha_raw = alpha.read()
    if not r_fg or not r_bg or not r_a:
        break # End of video

    if n == 0: # Pre-allocate
        fr_alpha = np.zeros_like(fr_alpha_raw, np.float32)
        fr_alpha_inv = np.zeros_like(fr_alpha_raw, np.float32)
        fr_fg_weighed = np.zeros_like(fr_alpha_raw, np.float32)
        fr_bg_weighed = np.zeros_like(fr_alpha_raw, np.float32)
        sum = np.zeros_like(fr_alpha_raw, np.float32)
        result = np.zeros_like(fr_alpha_raw, np.uint8)

    times[1] = timer()
    np.multiply(fr_alpha_raw, np.float32(1/255.0), fr_alpha)
    times[2] = timer()
    np.subtract(1.0, fr_alpha, fr_alpha_inv)
    times[3] = timer()
    np.multiply(fr_foreground, fr_alpha, fr_fg_weighed)
    times[4] = timer()
    np.multiply(fr_background, fr_alpha_inv, fr_bg_weighed)
    times[5] = timer()
    np.add(fr_fg_weighed, fr_bg_weighed, sum)
    times[6] = timer()
    np.copyto(result, sum, 'unsafe')
    times[7] = timer()
    cv2.imshow('My Image', result)
    times[8] = timer()
    if cv2.waitKey(1) == ord('q'): break
    times[9] = timer()
    update_times(times, total_times)
    n += 1

This gives us the following timings:

Iterations: 1786
Step 0: 7.0515 ms
Step 1: 3.8839 ms
Step 2: 1.9080 ms
Step 3: 4.5198 ms
Step 4: 4.3871 ms
Step 5: 2.7576 ms
Step 6: 1.9273 ms
Step 7: 0.4382 ms
Step 8: 7.2340 ms
Total: 34.1074 ms

Significant improvement in all the steps we modified. We're down to ~35% of the time needed by the original implementation.

Minor update:

Based on Silencer's answer I measured cv2.convertScaleAbs as well. It actually runs a bit faster:

Step 6: 1.2318 ms

That gave me another idea -- we could take advantage of cv2.add which let's us specify the destination data type and does a saturation cast as well. This would allow us to combine steps 5 and 6 together.

cv2.add(fr_fg_weighed, fr_bg_weighed, result, dtype=cv2.CV_8UC3)

which comes out at

Step 5: 3.3621 ms

Again a little win (previously we were around 3.9ms).

Following on from this, cv2.subtract and cv2.multiply are further candidates. We need to use a 4-element tuple to define a scalar (intricacy of the Python bindings), and we need to explicitly define an output data type for multiplication.

    cv2.subtract((1.0, 1.0, 1.0, 0.0), fr_alpha, fr_alpha_inv)
    cv2.multiply(fr_foreground, fr_alpha, fr_fg_weighed, dtype=cv2.CV_32FC3)
    cv2.multiply(fr_background, fr_alpha_inv, fr_bg_weighed, dtype=cv2.CV_32FC3)

Timings:

Step 2: 2.1897 ms
Step 3: 2.8981 ms
Step 4: 2.9066 ms

This seems to be about as far as we can get without some parallelization. We're already advantage of whatever OpenCV may provide in terms of individual operations, so we should focus on pipe-lining our implementation.

To help me figure out how to partition the code between the different piepeline stages (threads), I made a chart that shows all the operations, our best times for them, as well as the inter-dependencies for the calcuations:

WIP see comments for additional info while I write this up.