Search code examples
pythonfilterparallel-processingmultiprocessingpython-multithreading

Python filter array with multiprocessing


Right now I'm filtering an array using

arr = [a for a in tqdm(replays) if check(a)]

However with hundreds of thousands of elements, this takes alot of time. I was wondering if it was possible to do this with multiprocessing, ideally in a nice and compact pythonic way.

Thanks!


Solution

  • Define a multiprocessing-using parallel filter function pfilter:

    from multiprocessing import Pool
    
    def pfilter(filter_func, arr, cores):
        with Pool(cores) as p:
            booleans = p.map(filter_func, arr)
            return [x for x, b in zip(arr, booleans) if b]
    

    async means that order of execution is truly independent from each other between the elements.

    The usage in your case is (4 cpus):

    arr = pfilter(check, tqdm(replays), 4)
    

    For some weird reasons however, the filter_func isn't allowed to be a lambda expression or defined as one ...