I have array of 1000-2000 lists, and I need to pass each of that list into function. Each function call takes approximately 0.002s. I already tried using concurrent.futures.ProcessPoolExecutor and multiprocessing Pool. Each of that approach is 1.5x-2x slower than simply iterate over big array in a linear one-process way. Also, I've tried splitting big array of lists into several smaller arrays of lists, but, it isn't done job faster too, only slower. I assume, the reason behind that is switching between processes, but, theoreticly, splitting big array into chunks should've been fix that.
Is there any way to do this faster?
Lists are lines of image (numpy arrays) with values 0 or 255 (black and white pixels)
Function is detecting indexes of ranges of black lines on single line
[0, 0, 0, 255, 255, 0, 255, 255,255] -> ((0, 2), (5, 5))
Code for testing (objects_on_line needs to be parallel):
import numpy as np
from time import perf_counter
import cv2
OBJ_COLOR = 0
def chunks(lst, n):
"""Yield successive n-sized chunks from lst."""
for i in range(0, len(lst), n):
yield tuple(lst[i:i + n].tolist())
def objects_on_line(line):
line_original = np.array(line)
line = np.where(line_original==OBJ_COLOR)[0]
mask=[0]*len(line)
for i, el in enumerate(line[:-1]):
if line[i+1] - el == 1 and (i+2 < len(line) and line[i+2] - el == 2):
mask[i+1] = 1
line = np.ma.array(line, mask=mask).compressed()
i = 0
for _ in line[:-1]:
if i+1 >= len(line):
break
lv = line[i]
ls = line_original[lv:line[i+1]]
if len(np.where(ls != OBJ_COLOR)[0]) != 0:
line = np.insert(line, i, lv)
i += 2
if len(line) % 2 != 0:
line = np.insert(line, len(line), line[-1])
line = list(chunks(line, 2))
return line
img = cv2.cvtColor(cv2.imread("test_people.png"), cv2.COLOR_RGB2GRAY)
img_bin = cv2.threshold(img, 128, 255, cv2.THRESH_BINARY)
s = perf_counter()
for line in img_bin:
objects_on_line(line)
print(f"Done in: {perf_counter()-s}")
line_original = np.array(line)
Don't do that. Change the API you define, so there is
an objects_on_lines(image: np.array, lines: tuple[int])
function.
You want to put all the image pixels into
a single numpy array and then leave them there.
Pass in a (start, end)
range of indexes
for a core to chew on.
(Heck, you could even pass in a range
object.)
Avoid copying into a newly created list. Just use the pixels where they are. Point at them with indexes.
It's not clear that your system will be able to achieve 12x speedup, even if coded in Rust or C++. It might be the case that a handful of cores can max out your main memory bandwidth. Increasing the multiprogramming level beyond that, to 12 or 24, would be counter productive.
Take care to represent pixels compactly.
At least use uint8
.
If possible use a bit vector,
for an 8x reduction in memory bandwidth
requirements.
Consider staggering your offsets within the array. If all threads are hitting virtual memory addresses which conflict for the same cache line, then with 4-way associativity there's a fairly low limit on how many threads can productively contend for fresh pixels.
There's something called "the straggler effect". Your N identical cores won't necessarily be scheduled and perform identically. It's likely that at least one will complete its task after the others. So rather than breaking 2000 lists into N same-size lists, consider breaking out N "large" plus a couple of "small" lists. Begin working on the large lists initially. Then if a core is done with a large list early, it will find a small list to work on, overlapping with the work done by some straggler core.
Table lookups may help to speed your edge detection. Suppose each raster of bits is encoded in uint8 bytes, 2 bits at a time. Now there are four cases:
00
, no object01
, found the left edge10
, found the right edge11
, all objectCombine a bunch of 2-bit results to find the result for a whole raster.
More usefully, you might store > 2 bits
per array element. Four bits leads to a
16-entry lookup table. You might want to
filter salt & pepper noise from entries,
e.g. viewing 0101
(half object?) as 0111
(mostly object!).
A more efficient approach of 8 pixels per byte leads to a 256-entry lookup table.
If these pixels come from webcam video
frames, it might be useful to filter
current frame based on what was seen
in previous frame on same raster,
at least if there's some amount of
horizontal object motion appearing in
the feed.
Then your lookup table is indexed on
both prev
and cur
pixel values.