Search code examples
pythonmultiprocessinghpcpbs

Multiprocessing on PBS cluster node


I have to run multiple simulations of the same model with varying parameters (or random number generator seed). Previously I worked on a server with many cores, where I used python multiprocessing library with apply_async. This was very handy as I could decide the maximum number of cores to occupy and simulations would just go into a queue.

As I understand from other questions, multiprocessing works on pbs clusters as long as you work on just one node, which can be fine for now. However, my code doesn't always work.

To let you understand my kind of code:

import functions_library as L
import multiprocessing as mp
if __name__ == "__main__":

    N = 100

    proc = 50
    pool = mp.Pool(processes = proc)



    seed = 342
    np.random.seed(seed)

    seeds = np.random.randint(low=1,high=100000,size=N)

    resul = []
    for SEED in seeds:

        SEED = int(SEED)

        resul.append(pool.apply_async(L.some_function, args = (some_args)))
        print(SEED)

    results = [p.get() for p in resul]

    database = pd.DataFrame(results)


    database.to_csv("prova.csv")

The function creates 3 N=10000 networkx graphs and perform some computations on them, then returns a simple short python dictionary.

The weird thing I cannot debug is the following error message:

multiprocessing.pool.MaybeEncodingError: Error sending result: >''. >Reason: 'RecursionError('maximum recursion depth exceeded while calling a >Python object')'

What's strange is that I run multiple istances of the code on different nodes. 3 times the code correctly worked, whereas most of the times it returns the previous error. I tried lunching different number of parallel simulation, from 7 to 20 (# cores of the nodes), but there doesn't seem to be a pattern, so I guess it's not a memory issue.

In other questions similar error seems to be related to pickling strange or big objects, but in this case the only thing that comes out of the function is a short dictionary, so it shouldn't be related to that. I also tried increasing the allowed recursion depth with the sys library at the beginning og the work but didn't work up to 15000.

Any idea to solve or at least understand this behavior?


Solution

  • It was related to eigenvector_centrality() not converging. When running outside of multiprocessing it correctly returns a networkx error, whereas inside it only this recursion error is returned.

    I am not aware if this is a weird very function specific behavior or sometimes multiprocessing cannot handle some library errors.