I want to blend a video on top of another one using an alpha video. This is my code. It works perfectly but the problem is that this code isn't efficient at all and that's because of /255
parts. It is slow and has lagging probelm.
Is there a standard and efficient way for doing this? I want results be real-time. Thanks
import cv2
import numpy as np
def main():
foreground = cv2.VideoCapture('circle.mp4')
background = cv2.VideoCapture('video.MP4')
alpha = cv2.VideoCapture('circle_alpha.mp4')
while foreground.isOpened():
fr_foreground = foreground.read()[1]/255
fr_background = background.read()[1]/255
fr_alpha = alpha.read()[1]/255
cv2.imshow('My Image',cmb(fr_foreground,fr_background,fr_alpha))
if cv2.waitKey(1) == ord('q'): break
cv2.destroyAllWindows
def cmb(fg,bg,a):
return fg * a + bg * (1-a)
if __name__ == '__main__':
main()
Let's get few obvious problems out of the way first - foreground.isOpened()
will return true even after you have reached the end of the video, so your program will end up crashing at that point.
The solution is twofold. First of all, test all 3 VideoCapture
instances right after you create them, using something like:
if not foreground.isOpened() or not background.isOpened() or not alpha.isOpened():
print "Unable to open input videos."
return
That will make sure that all of them opened properly. Next part is tho correctly handle reaching the end of the video. That means either checking the first of the two return values of read()
, which is a boolean flag representing success, or test whether the frame is None
.
while True:
r_fg, fr_foreground = foreground.read()
r_bg, fr_background = background.read()
r_a, fr_alpha = alpha.read()
if not r_fg or not r_bg or not r_a:
break # End of video
Additionally, it seems you don't actually call cv2.destroyAllWindows()
-- the ()
is missing. Not that this really matters.
To help investigate and optimize this, i've added some detailed timing, using the timeit
module and couple of convenience functions
from timeit import default_timer as timer
def update_times(times, total_times):
for i in range(len(times) - 1):
total_times[i] += (times[i+1]-times[i]) * 1000
def print_times(total_times, n):
print "Iterations: %d" % n
for i in range(len(total_times)):
print "Step %d: %0.4f ms" % (i, total_times[i] / n)
print "Total: %0.4f ms" % (np.sum(total_times) / n)
and modified the main()
function to measure the time taken by each logical step -- read, scale, blend, show, waitKey. To do this I split the division into separate statements. I also made a slight modification that makes this work in Python 2.x as well (/255
is interpeted as integer division and yields wrong results).
times = [0.0] * 6
total_times = [0.0] * (len(times) - 1)
n = 0
while True:
times[0] = timer()
r_fg, fr_foreground = foreground.read()
r_bg, fr_background = background.read()
r_a, fr_alpha = alpha.read()
if not r_fg or not r_bg or not r_a:
break # End of video
times[1] = timer()
fr_foreground = fr_foreground / 255.0
fr_background = fr_background / 255.0
fr_alpha = fr_alpha / 255.0
times[2] = timer()
result = cmb(fr_foreground,fr_background,fr_alpha)
times[3] = timer()
cv2.imshow('My Image', result)
times[4] = timer()
if cv2.waitKey(1) == ord('q'): break
times[5] = timer()
update_times(times, total_times)
n += 1
print_times(total_times, n)
When I run this with 1280x800 mp4 videos as input, I notice that it really is rather sluggish, and that it only uses 15% CPU on my 6 core machine. The timing of the sections is as follows:
Iterations: 1190
Step 0: 11.4385 ms
Step 1: 37.1320 ms
Step 2: 39.4083 ms
Step 3: 2.5488 ms
Step 4: 10.7083 ms
Total: 101.2358 ms
This suggests that the biggest bottlenecks are both the scaling step and the blending step. The low CPU usage is also suboptimal, but let's focus on the low-hanging fruit first.
Let's look at the data types of the numpy arrays we use. read()
gives us arrays with dtype
of np.uint8
-- 8bit unsigned integers. However, the floating point division (as written) will yield an array with dtype
of np.float64
-- 64bit floating point values. We don't really need this level of precision for our algorithm, so we'd be better of using only 32bit floats -- it would mean that if any of the operations are vectorized, we can potentially do twice as many calculations in the same amount of time.
There are two options here. We could simply cast the divisor to np.float32
, which will cause numpy to give us the result with the same dtype
:
fr_foreground = fr_foreground / np.float32(255.0)
fr_background = fr_background / np.float32(255.0)
fr_alpha = fr_alpha / np.float32(255.0)
Which gives us the following timings:
Iterations: 1786
Step 0: 9.2550 ms
Step 1: 19.0144 ms
Step 2: 21.2120 ms
Step 3: 1.4662 ms
Step 4: 10.8889 ms
Total: 61.8365 ms
Or we could cast the array to np.float32
first, and then do the scaling in-place.
fr_foreground = np.float32(fr_foreground)
fr_background = np.float32(fr_background)
fr_alpha = np.float32(fr_alpha)
fr_foreground /= 255.0
fr_background /= 255.0
fr_alpha /= 255.0
Which gives the following timings (splitting step 1 into conversion (1) and scaling (2) -- rest shifts by 1):
Iterations: 1786
Step 0: 9.0589 ms
Step 1: 13.9614 ms
Step 2: 4.5960 ms
Step 3: 20.9279 ms
Step 4: 1.4631 ms
Step 5: 10.4396 ms
Total: 60.4469 ms
Both are roughtly equivalent, running at ~60% of the original time. I'll stick with the second option, since it will become useful in the later steps. Let's see what else we can improve.
From the previous timings, we can see that the scaling is no longer the bottleneck, but an idea still comes to mind -- division is generally slower than multiplication, so what if we multiplied by a reciprocal?
fr_foreground *= 1/255.0
fr_background *= 1/255.0
fr_alpha *= 1/255.0
Indeed this does gain us a millisecond -- nothing spectacular, but it was easy, so might as well go with it:
Iterations: 1786
Step 0: 9.1843 ms
Step 1: 14.2349 ms
Step 2: 3.5752 ms
Step 3: 21.0545 ms
Step 4: 1.4692 ms
Step 5: 10.6917 ms
Total: 60.2097 ms
Now the blending function is the biggest bottleneck, followed by the typecast of all 3 arrays. If we look at what the blending operation does:
foreground * alpha + background * (1.0 - alpha)
we can observe that for the math to work, the only value that needs to be in range (0.0, 1.0) is alpha
.
What if we only scaled only the alpha image? Also, since multiplication by floating point will promote to floating point, what if we also skipped the type conversion? That would mean cmb()
would have to return np.uint8
array
def cmb(fg,bg,a):
return np.uint8(fg * a + bg * (1-a))
and we would have
#fr_foreground = np.float32(fr_foreground)
#fr_background = np.float32(fr_background)
fr_alpha = np.float32(fr_alpha)
#fr_foreground *= 1/255.0
#fr_background *= 1/255.0
fr_alpha *= 1/255.0
The times for this are
Step 0: 7.7023 ms
Step 1: 4.6758 ms
Step 2: 1.1061 ms
Step 3: 27.3188 ms
Step 4: 0.4783 ms
Step 5: 9.0027 ms
Total: 50.2840 ms
Obviously, steps 1 and 2 are much faster, since we only do 1/3 of the work. imshow
also speeds up, since it doens't have to convert from floating point. Inexplicably, the reads also got faster (I guess we're avoiding some under the hood reallocations, since fr_foreground
and fr_background
always contain the pristine frame). We do pay the price of an additional cast in cmb()
, but overall this seems a win -- we're at 50% of the original time.
To continue, let's get rid of the cmb()
function, move its functionality to main()
and split it up to measure the cost of each of the operations. Let's also try to reuse the result of alpha.read()
(since we recently saw that improvement in read()
performance):
times = [0.0] * 11
total_times = [0.0] * (len(times) - 1)
n = 0
while True:
times[0] = timer()
r_fg, fr_foreground = foreground.read()
r_bg, fr_background = background.read()
r_a, fr_alpha_raw = alpha.read()
if not r_fg or not r_bg or not r_a:
break # End of video
times[1] = timer()
fr_alpha = np.float32(fr_alpha_raw)
times[2] = timer()
fr_alpha *= 1/255.0
times[3] = timer()
fr_alpha_inv = 1.0 - fr_alpha
times[4] = timer()
fr_fg_weighed = fr_foreground * fr_alpha
times[5] = timer()
fr_bg_weighed = fr_background * fr_alpha_inv
times[6] = timer()
sum = fr_fg_weighed + fr_bg_weighed
times[7] = timer()
result = np.uint8(sum)
times[8] = timer()
cv2.imshow('My Image', result)
times[9] = timer()
if cv2.waitKey(1) == ord('q'): break
times[10] = timer()
update_times(times, total_times)
n += 1
New timings:
Iterations: 1786
Step 0: 6.8733 ms
Step 1: 5.2742 ms
Step 2: 1.1430 ms
Step 3: 4.5800 ms
Step 4: 7.0372 ms
Step 5: 7.0675 ms
Step 6: 5.3082 ms
Step 7: 2.6912 ms
Step 8: 0.4658 ms
Step 9: 9.6966 ms
Total: 50.1372 ms
We didn't really gain anything, but the reads got noticeably faster.
This leads to another idea -- what if we tried to minimize allocations and reuse the arrays in subsequent iterations?
We can pre-allocate the necessary arrays in the first iteration (using numpy.zeros_like
), after we read the first set of frames:
if n == 0: # Pre-allocate
fr_alpha = np.zeros_like(fr_alpha_raw, np.float32)
fr_alpha_inv = np.zeros_like(fr_alpha_raw, np.float32)
fr_fg_weighed = np.zeros_like(fr_alpha_raw, np.float32)
fr_bg_weighed = np.zeros_like(fr_alpha_raw, np.float32)
sum = np.zeros_like(fr_alpha_raw, np.float32)
result = np.zeros_like(fr_alpha_raw, np.uint8)
Now, we can use
numpy.add
for additionnumpy.subtract
for subtractionnumpy.multiply
for multiplicationnumpy.copyto
for type conversionWe can also merge steps 1 and 2 together, using a single numpy.multiply
.
times = [0.0] * 10
total_times = [0.0] * (len(times) - 1)
n = 0
while True:
times[0] = timer()
r_fg, fr_foreground = foreground.read()
r_bg, fr_background = background.read()
r_a, fr_alpha_raw = alpha.read()
if not r_fg or not r_bg or not r_a:
break # End of video
if n == 0: # Pre-allocate
fr_alpha = np.zeros_like(fr_alpha_raw, np.float32)
fr_alpha_inv = np.zeros_like(fr_alpha_raw, np.float32)
fr_fg_weighed = np.zeros_like(fr_alpha_raw, np.float32)
fr_bg_weighed = np.zeros_like(fr_alpha_raw, np.float32)
sum = np.zeros_like(fr_alpha_raw, np.float32)
result = np.zeros_like(fr_alpha_raw, np.uint8)
times[1] = timer()
np.multiply(fr_alpha_raw, np.float32(1/255.0), fr_alpha)
times[2] = timer()
np.subtract(1.0, fr_alpha, fr_alpha_inv)
times[3] = timer()
np.multiply(fr_foreground, fr_alpha, fr_fg_weighed)
times[4] = timer()
np.multiply(fr_background, fr_alpha_inv, fr_bg_weighed)
times[5] = timer()
np.add(fr_fg_weighed, fr_bg_weighed, sum)
times[6] = timer()
np.copyto(result, sum, 'unsafe')
times[7] = timer()
cv2.imshow('My Image', result)
times[8] = timer()
if cv2.waitKey(1) == ord('q'): break
times[9] = timer()
update_times(times, total_times)
n += 1
This gives us the following timings:
Iterations: 1786
Step 0: 7.0515 ms
Step 1: 3.8839 ms
Step 2: 1.9080 ms
Step 3: 4.5198 ms
Step 4: 4.3871 ms
Step 5: 2.7576 ms
Step 6: 1.9273 ms
Step 7: 0.4382 ms
Step 8: 7.2340 ms
Total: 34.1074 ms
Significant improvement in all the steps we modified. We're down to ~35% of the time needed by the original implementation.
Minor update:
Based on Silencer's answer I measured cv2.convertScaleAbs
as well. It actually runs a bit faster:
Step 6: 1.2318 ms
That gave me another idea -- we could take advantage of cv2.add
which let's us specify the destination data type and does a saturation cast as well. This would allow us to combine steps 5 and 6 together.
cv2.add(fr_fg_weighed, fr_bg_weighed, result, dtype=cv2.CV_8UC3)
which comes out at
Step 5: 3.3621 ms
Again a little win (previously we were around 3.9ms).
Following on from this, cv2.subtract
and cv2.multiply
are further candidates. We need to use a 4-element tuple to define a scalar (intricacy of the Python bindings), and we need to explicitly define an output data type for multiplication.
cv2.subtract((1.0, 1.0, 1.0, 0.0), fr_alpha, fr_alpha_inv)
cv2.multiply(fr_foreground, fr_alpha, fr_fg_weighed, dtype=cv2.CV_32FC3)
cv2.multiply(fr_background, fr_alpha_inv, fr_bg_weighed, dtype=cv2.CV_32FC3)
Timings:
Step 2: 2.1897 ms
Step 3: 2.8981 ms
Step 4: 2.9066 ms
This seems to be about as far as we can get without some parallelization. We're already advantage of whatever OpenCV may provide in terms of individual operations, so we should focus on pipe-lining our implementation.
To help me figure out how to partition the code between the different piepeline stages (threads), I made a chart that shows all the operations, our best times for them, as well as the inter-dependencies for the calcuations:
WIP see comments for additional info while I write this up.