Search code examples
pythonopencvlagalphablending

Alpha Blending using OpenCV for videos


I want to blend a video on top of another one using an alpha video. This is my code. It works perfectly but the problem is that this code isn't efficient at all and that's because of /255 parts. It is slow and has lagging probelm.

Is there a standard and efficient way for doing this? I want results be real-time. Thanks

import cv2
import numpy as np

def main():
    foreground = cv2.VideoCapture('circle.mp4')
    background = cv2.VideoCapture('video.MP4')
    alpha = cv2.VideoCapture('circle_alpha.mp4')

    while foreground.isOpened():
        fr_foreground = foreground.read()[1]/255
        fr_background = background.read()[1]/255     
        fr_alpha = alpha.read()[1]/255

        cv2.imshow('My Image',cmb(fr_foreground,fr_background,fr_alpha))

        if cv2.waitKey(1) == ord('q'): break

    cv2.destroyAllWindows

def cmb(fg,bg,a):
    return fg * a + bg * (1-a)

if __name__ == '__main__':
    main()

Solution

  • Let's get few obvious problems out of the way first - foreground.isOpened() will return true even after you have reached the end of the video, so your program will end up crashing at that point. The solution is twofold. First of all, test all 3 VideoCapture instances right after you create them, using something like:

    if not foreground.isOpened() or not background.isOpened() or not alpha.isOpened():
        print "Unable to open input videos."
        return
    

    That will make sure that all of them opened properly. Next part is tho correctly handle reaching the end of the video. That means either checking the first of the two return values of read(), which is a boolean flag representing success, or test whether the frame is None.

    while True:
        r_fg, fr_foreground = foreground.read()
        r_bg, fr_background = background.read()
        r_a, fr_alpha = alpha.read()
        if not r_fg or not r_bg or not r_a:
            break # End of video
    

    Additionally, it seems you don't actually call cv2.destroyAllWindows() -- the () is missing. Not that this really matters.


    To help investigate and optimize this, i've added some detailed timing, using the timeit module and couple of convenience functions

    from timeit import default_timer as timer
    
    def update_times(times, total_times):
        for i in range(len(times) - 1):
            total_times[i] += (times[i+1]-times[i]) * 1000
    
    def print_times(total_times, n):
        print "Iterations: %d" % n
        for i in range(len(total_times)):
            print "Step %d: %0.4f ms" % (i, total_times[i] / n)
        print "Total: %0.4f ms" % (np.sum(total_times) / n)
    

    and modified the main() function to measure the time taken by each logical step -- read, scale, blend, show, waitKey. To do this I split the division into separate statements. I also made a slight modification that makes this work in Python 2.x as well (/255 is interpeted as integer division and yields wrong results).

    times = [0.0] * 6
    total_times = [0.0] * (len(times) - 1)
    n = 0
    while True:
        times[0] = timer()
        r_fg, fr_foreground = foreground.read()
        r_bg, fr_background = background.read()
        r_a, fr_alpha = alpha.read()
        if not r_fg or not r_bg or not r_a:
            break # End of video
        times[1] = timer()
        fr_foreground = fr_foreground / 255.0
        fr_background = fr_background / 255.0
        fr_alpha = fr_alpha / 255.0
        times[2] = timer()
        result = cmb(fr_foreground,fr_background,fr_alpha)
        times[3] = timer()
        cv2.imshow('My Image', result)
        times[4] = timer()
        if cv2.waitKey(1) == ord('q'): break
        times[5] = timer()
        update_times(times, total_times)
        n += 1
    
    print_times(total_times, n)
    

    When I run this with 1280x800 mp4 videos as input, I notice that it really is rather sluggish, and that it only uses 15% CPU on my 6 core machine. The timing of the sections is as follows:

    Iterations: 1190
    Step 0: 11.4385 ms
    Step 1: 37.1320 ms
    Step 2: 39.4083 ms
    Step 3: 2.5488 ms
    Step 4: 10.7083 ms
    Total: 101.2358 ms
    

    This suggests that the biggest bottlenecks are both the scaling step and the blending step. The low CPU usage is also suboptimal, but let's focus on the low-hanging fruit first.


    Let's look at the data types of the numpy arrays we use. read() gives us arrays with dtype of np.uint8 -- 8bit unsigned integers. However, the floating point division (as written) will yield an array with dtype of np.float64 -- 64bit floating point values. We don't really need this level of precision for our algorithm, so we'd be better of using only 32bit floats -- it would mean that if any of the operations are vectorized, we can potentially do twice as many calculations in the same amount of time.

    There are two options here. We could simply cast the divisor to np.float32, which will cause numpy to give us the result with the same dtype:

    fr_foreground = fr_foreground / np.float32(255.0)
    fr_background = fr_background / np.float32(255.0)
    fr_alpha = fr_alpha / np.float32(255.0)
    

    Which gives us the following timings:

    Iterations: 1786
    Step 0: 9.2550 ms
    Step 1: 19.0144 ms
    Step 2: 21.2120 ms
    Step 3: 1.4662 ms
    Step 4: 10.8889 ms
    Total: 61.8365 ms
    

    Or we could cast the array to np.float32 first, and then do the scaling in-place.

    fr_foreground = np.float32(fr_foreground)
    fr_background = np.float32(fr_background)
    fr_alpha = np.float32(fr_alpha)
    
    fr_foreground /= 255.0
    fr_background /= 255.0
    fr_alpha /= 255.0
    

    Which gives the following timings (splitting step 1 into conversion (1) and scaling (2) -- rest shifts by 1):

    Iterations: 1786
    Step 0: 9.0589 ms
    Step 1: 13.9614 ms
    Step 2: 4.5960 ms
    Step 3: 20.9279 ms
    Step 4: 1.4631 ms
    Step 5: 10.4396 ms
    Total: 60.4469 ms
    

    Both are roughtly equivalent, running at ~60% of the original time. I'll stick with the second option, since it will become useful in the later steps. Let's see what else we can improve.


    From the previous timings, we can see that the scaling is no longer the bottleneck, but an idea still comes to mind -- division is generally slower than multiplication, so what if we multiplied by a reciprocal?

    fr_foreground *= 1/255.0
    fr_background *= 1/255.0
    fr_alpha *= 1/255.0
    

    Indeed this does gain us a millisecond -- nothing spectacular, but it was easy, so might as well go with it:

    Iterations: 1786
    Step 0: 9.1843 ms
    Step 1: 14.2349 ms
    Step 2: 3.5752 ms
    Step 3: 21.0545 ms
    Step 4: 1.4692 ms
    Step 5: 10.6917 ms
    Total: 60.2097 ms
    

    Now the blending function is the biggest bottleneck, followed by the typecast of all 3 arrays. If we look at what the blending operation does:

    foreground * alpha + background * (1.0 - alpha)
    

    we can observe that for the math to work, the only value that needs to be in range (0.0, 1.0) is alpha.

    What if we only scaled only the alpha image? Also, since multiplication by floating point will promote to floating point, what if we also skipped the type conversion? That would mean cmb() would have to return np.uint8 array

    def cmb(fg,bg,a):
        return np.uint8(fg * a + bg * (1-a))
    

    and we would have

        #fr_foreground = np.float32(fr_foreground)
        #fr_background = np.float32(fr_background)
        fr_alpha = np.float32(fr_alpha)
    
        #fr_foreground *= 1/255.0
        #fr_background *= 1/255.0
        fr_alpha *= 1/255.0
    

    The times for this are

    Step 0: 7.7023 ms
    Step 1: 4.6758 ms
    Step 2: 1.1061 ms
    Step 3: 27.3188 ms
    Step 4: 0.4783 ms
    Step 5: 9.0027 ms
    Total: 50.2840 ms
    

    Obviously, steps 1 and 2 are much faster, since we only do 1/3 of the work. imshow also speeds up, since it doens't have to convert from floating point. Inexplicably, the reads also got faster (I guess we're avoiding some under the hood reallocations, since fr_foreground and fr_background always contain the pristine frame). We do pay the price of an additional cast in cmb(), but overall this seems a win -- we're at 50% of the original time.


    To continue, let's get rid of the cmb() function, move its functionality to main() and split it up to measure the cost of each of the operations. Let's also try to reuse the result of alpha.read() (since we recently saw that improvement in read() performance):

    times = [0.0] * 11
    total_times = [0.0] * (len(times) - 1)
    n = 0
    while True:
        times[0] = timer()
        r_fg, fr_foreground = foreground.read()
        r_bg, fr_background = background.read()
        r_a, fr_alpha_raw = alpha.read()
        if not r_fg or not r_bg or not r_a:
            break # End of video
    
        times[1] = timer()
        fr_alpha = np.float32(fr_alpha_raw)
        times[2] = timer()
        fr_alpha *= 1/255.0
        times[3] = timer()
        fr_alpha_inv = 1.0 - fr_alpha
        times[4] = timer()
        fr_fg_weighed = fr_foreground * fr_alpha
        times[5] = timer()
        fr_bg_weighed = fr_background * fr_alpha_inv
        times[6] = timer()
        sum = fr_fg_weighed + fr_bg_weighed
        times[7] = timer()
        result = np.uint8(sum)
        times[8] = timer()
        cv2.imshow('My Image', result)
        times[9] = timer()
        if cv2.waitKey(1) == ord('q'): break
        times[10] = timer()
        update_times(times, total_times)
        n += 1
    

    New timings:

    Iterations: 1786
    Step 0: 6.8733 ms
    Step 1: 5.2742 ms
    Step 2: 1.1430 ms
    Step 3: 4.5800 ms
    Step 4: 7.0372 ms
    Step 5: 7.0675 ms
    Step 6: 5.3082 ms
    Step 7: 2.6912 ms
    Step 8: 0.4658 ms
    Step 9: 9.6966 ms
    Total: 50.1372 ms
    

    We didn't really gain anything, but the reads got noticeably faster.


    This leads to another idea -- what if we tried to minimize allocations and reuse the arrays in subsequent iterations?

    We can pre-allocate the necessary arrays in the first iteration (using numpy.zeros_like), after we read the first set of frames:

    if n == 0: # Pre-allocate
        fr_alpha = np.zeros_like(fr_alpha_raw, np.float32)
        fr_alpha_inv = np.zeros_like(fr_alpha_raw, np.float32)
        fr_fg_weighed = np.zeros_like(fr_alpha_raw, np.float32)
        fr_bg_weighed = np.zeros_like(fr_alpha_raw, np.float32)
        sum = np.zeros_like(fr_alpha_raw, np.float32)
        result = np.zeros_like(fr_alpha_raw, np.uint8)
    

    Now, we can use

    We can also merge steps 1 and 2 together, using a single numpy.multiply.

    times = [0.0] * 10
    total_times = [0.0] * (len(times) - 1)
    n = 0
    while True:
        times[0] = timer()
        r_fg, fr_foreground = foreground.read()
        r_bg, fr_background = background.read()
        r_a, fr_alpha_raw = alpha.read()
        if not r_fg or not r_bg or not r_a:
            break # End of video
    
        if n == 0: # Pre-allocate
            fr_alpha = np.zeros_like(fr_alpha_raw, np.float32)
            fr_alpha_inv = np.zeros_like(fr_alpha_raw, np.float32)
            fr_fg_weighed = np.zeros_like(fr_alpha_raw, np.float32)
            fr_bg_weighed = np.zeros_like(fr_alpha_raw, np.float32)
            sum = np.zeros_like(fr_alpha_raw, np.float32)
            result = np.zeros_like(fr_alpha_raw, np.uint8)
    
        times[1] = timer()
        np.multiply(fr_alpha_raw, np.float32(1/255.0), fr_alpha)
        times[2] = timer()
        np.subtract(1.0, fr_alpha, fr_alpha_inv)
        times[3] = timer()
        np.multiply(fr_foreground, fr_alpha, fr_fg_weighed)
        times[4] = timer()
        np.multiply(fr_background, fr_alpha_inv, fr_bg_weighed)
        times[5] = timer()
        np.add(fr_fg_weighed, fr_bg_weighed, sum)
        times[6] = timer()
        np.copyto(result, sum, 'unsafe')
        times[7] = timer()
        cv2.imshow('My Image', result)
        times[8] = timer()
        if cv2.waitKey(1) == ord('q'): break
        times[9] = timer()
        update_times(times, total_times)
        n += 1
    

    This gives us the following timings:

    Iterations: 1786
    Step 0: 7.0515 ms
    Step 1: 3.8839 ms
    Step 2: 1.9080 ms
    Step 3: 4.5198 ms
    Step 4: 4.3871 ms
    Step 5: 2.7576 ms
    Step 6: 1.9273 ms
    Step 7: 0.4382 ms
    Step 8: 7.2340 ms
    Total: 34.1074 ms
    

    Significant improvement in all the steps we modified. We're down to ~35% of the time needed by the original implementation.


    Minor update:

    Based on Silencer's answer I measured cv2.convertScaleAbs as well. It actually runs a bit faster:

    Step 6: 1.2318 ms
    

    That gave me another idea -- we could take advantage of cv2.add which let's us specify the destination data type and does a saturation cast as well. This would allow us to combine steps 5 and 6 together.

    cv2.add(fr_fg_weighed, fr_bg_weighed, result, dtype=cv2.CV_8UC3)
    

    which comes out at

    Step 5: 3.3621 ms
    

    Again a little win (previously we were around 3.9ms).

    Following on from this, cv2.subtract and cv2.multiply are further candidates. We need to use a 4-element tuple to define a scalar (intricacy of the Python bindings), and we need to explicitly define an output data type for multiplication.

        cv2.subtract((1.0, 1.0, 1.0, 0.0), fr_alpha, fr_alpha_inv)
        cv2.multiply(fr_foreground, fr_alpha, fr_fg_weighed, dtype=cv2.CV_32FC3)
        cv2.multiply(fr_background, fr_alpha_inv, fr_bg_weighed, dtype=cv2.CV_32FC3)
    

    Timings:

    Step 2: 2.1897 ms
    Step 3: 2.8981 ms
    Step 4: 2.9066 ms
    

    This seems to be about as far as we can get without some parallelization. We're already advantage of whatever OpenCV may provide in terms of individual operations, so we should focus on pipe-lining our implementation.

    To help me figure out how to partition the code between the different piepeline stages (threads), I made a chart that shows all the operations, our best times for them, as well as the inter-dependencies for the calcuations:

    enter image description here

    WIP see comments for additional info while I write this up.