python parallel-processing multiprocessing python-multiprocessing

How to multiprocess lots of small tasks

I have array of 1000-2000 lists, and I need to pass each of that list into function. Each function call takes approximately 0.002s. I already tried using concurrent.futures.ProcessPoolExecutor and multiprocessing Pool. Each of that approach is 1.5x-2x slower than simply iterate over big array in a linear one-process way. Also, I've tried splitting big array of lists into several smaller arrays of lists, but, it isn't done job faster too, only slower. I assume, the reason behind that is switching between processes, but, theoreticly, splitting big array into chunks should've been fix that.

Is there any way to do this faster?

Lists are lines of image (numpy arrays) with values 0 or 255 (black and white pixels)

Function is detecting indexes of ranges of black lines on single line

[0, 0, 0, 255, 255, 0, 255, 255,255] -> ((0, 2), (5, 5))

Code for testing (objects_on_line needs to be parallel):

import numpy as np
from time import perf_counter
import cv2

OBJ_COLOR = 0

def chunks(lst, n):
    """Yield successive n-sized chunks from lst."""
    for i in range(0, len(lst), n):
        yield tuple(lst[i:i + n].tolist())

def objects_on_line(line):
    line_original = np.array(line)
    line = np.where(line_original==OBJ_COLOR)[0]
    mask=[0]*len(line)

    for i, el in enumerate(line[:-1]):
        if line[i+1] - el == 1 and (i+2 < len(line) and line[i+2] - el == 2):
            mask[i+1] = 1

    line = np.ma.array(line, mask=mask).compressed()

    i = 0
    for _ in line[:-1]:
        if i+1 >= len(line):
            break

        lv = line[i]
        ls = line_original[lv:line[i+1]]

        if len(np.where(ls != OBJ_COLOR)[0]) != 0:
            line = np.insert(line, i, lv)

        i += 2

    if len(line) % 2 != 0:
        line = np.insert(line, len(line), line[-1])

    line = list(chunks(line, 2))

    return line



img = cv2.cvtColor(cv2.imread("test_people.png"), cv2.COLOR_RGB2GRAY)
img_bin = cv2.threshold(img, 128, 255, cv2.THRESH_BINARY)

s = perf_counter()

for line in img_bin:
    objects_on_line(line)

print(f"Done in: {perf_counter()-s}")

Solution

line_original = np.array(line)

Don't do that. Change the API you define, so there is an objects_on_lines(image: np.array, lines: tuple[int]) function.

You want to put all the image pixels into a single numpy array and then leave them there. Pass in a (start, end) range of indexes for a core to chew on. (Heck, you could even pass in a range object.)

Avoid copying into a newly created list. Just use the pixels where they are. Point at them with indexes.

It's not clear that your system will be able to achieve 12x speedup, even if coded in Rust or C++. It might be the case that a handful of cores can max out your main memory bandwidth. Increasing the multiprogramming level beyond that, to 12 or 24, would be counter productive.

Take care to represent pixels compactly. At least use uint8. If possible use a bit vector, for an 8x reduction in memory bandwidth requirements.

Consider staggering your offsets within the array. If all threads are hitting virtual memory addresses which conflict for the same cache line, then with 4-way associativity there's a fairly low limit on how many threads can productively contend for fresh pixels.

There's something called "the straggler effect". Your N identical cores won't necessarily be scheduled and perform identically. It's likely that at least one will complete its task after the others. So rather than breaking 2000 lists into N same-size lists, consider breaking out N "large" plus a couple of "small" lists. Begin working on the large lists initially. Then if a core is done with a large list early, it will find a small list to work on, overlapping with the work done by some straggler core.

Table lookups may help to speed your edge detection. Suppose each raster of bits is encoded in uint8 bytes, 2 bits at a time. Now there are four cases:

00, no object
01, found the left edge
10, found the right edge
11, all object

Combine a bunch of 2-bit results to find the result for a whole raster.

More usefully, you might store > 2 bits per array element. Four bits leads to a 16-entry lookup table. You might want to filter salt & pepper noise from entries, e.g. viewing 0101 (half object?) as 0111 (mostly object!).

A more efficient approach of 8 pixels per byte leads to a 256-entry lookup table.

If these pixels come from webcam video frames, it might be useful to filter current frame based on what was seen in previous frame on same raster, at least if there's some amount of horizontal object motion appearing in the feed. Then your lookup table is indexed on both prev and cur pixel values.