Search code examples
pythonmultithreadingjoblib

Using joblib is slower than single thread loop


from tqdm import tqdm
from joblib import Parallel, delayed
import numpy as np
import time

def func(x,a,b,c):
    """x3+ax2+bx+c"""
    return x**3+a*x**2+b*x+c

def func_prime(x,a,b):
    """3x2+2ax+b"""
    return 3*x**2+2*a*x+b

def check(i):
    a,b,c = i
    if func_prime(-1/4, a,b) != -1/4:
        return False
    if func_prime(1/4,a,b) >=0:
        return False
    for i in range(-300,300):
        if func(i-1, a,b,c)*func(i+1,a,b,c)<0:
            return False
    return a,b,c

def gen():
    a=np.arange(-1, 1, 1/128)
    b=np.arange(-1, 1, 1/128)
    c=np.arange(-1, 1, 1/128)
    for i_a in a:
        for i_b in b:
            for i_c in c:
                yield i_a, i_b, i_c

# single thread. it spend 11 seconds. 
result = set()
for i in tqdm(gen()):
    result.update([check(i)])

# using joblib. it spend 313 seconds. 
start_time = time.time()
_ = Parallel(n_jobs=10)(delayed(check)(i) for i in gen())
end_time = time.time()
print(end_time - start_time)

I'm tring to solve a math quiz for fun, and solution is "fx=x3-0.375x2-0.625x + 0".

My code test all combinations of values, so I thought parallel programming make it faster. But this is not so, it is much slower than with a single thread.

My questions:

  1. why multithread is slower?
  2. how can I fix it faster than single thread?
  3. early stopping. if I find a solution for it, quit asap. how can I implement that?

Solution

  • tl;dr: You're iterating in the wrong place.


    gen() produces many values (triples), which are serialized and sent down pipes to worker sub-processes.

    def check(i):
        a, b, c = i
        ...
        for i in range(-300, 300):
    

    First, that second i is the wrong identifier, it should be x. Starting out with a tuple and then switching to a scalar is just bizarre, avoid doing that.

    Second, the for loop may exit early. Now, if we typically do six hundred iterations, this could possibly be a sensible parent-child work balance. But I get the sense that we're spending a lot of time serializing those 3-tuples, sending them down a pipe, deserializing them, and quickly doing an early exit long before hitting six hundred.

    You have nested a, b, c loops and ten cores. I recommend that the parent only loops over a values, and lets the child worry about the b, c iteration. That way you spend a greater fraction of CPU cycles on evaluating func than on serializing / deserializing.


    Separate issue:

        if func_prime(-0.25, a, b) != -0.25:
    

    It's not usually recommended to test a float for exact equality (or in this case, inequality). When comparing x == y, prefer to compute whether abs(x - y) is less than some small epsilon (or in this case, greater than ϵ ).

    We have a special case, here, reciprocal of a power of 2, so IEEE-754 can exactly represent the quantity of interest. But in general, you should develop some healthy level of paranoia about FP rounding issues.