python multiprocessing pickle python-multiprocessing

Slow pickle dump in python when using multiprocessing

So, i'm trying to parallelize a function which solves pyomo instances with python 3.7, using the multiprocessing module. The code works, but the startup time is absurd (~25 seconds per process). Weird thing is, i tried the same code in another, but way less powerful computer and it went down to ~2 seconds (same code, same amount of parallel processes, same versions for everything but Python, which is 3.6 on that pc).

Using cProfile, i found out that the dump method of the pickler was the one consuming that much time, but i can't seem to understand why would it take so long. The data is small, and i checked by using sys.getsizeof() to see if any of the arguments of the parallelized function were larger that expected, but they were not.

Does anyone know what could be the cause of the slow pickle dump?

The code:

from pyomo.environ import *
from pyomo.opt import SolverFactory, TerminationCondition
from pyomo.opt.parallel import SolverManagerFactory
import sys
import multiprocessing

def worker(init_nodes[i_nodo][j_nodo], data, optsolver, queue, shared_incumbent_data):
    #[pyomo instances solving and constraining]
    return

def foo(model, data, optsolver, processes = multiprocessing.cpu_count()):

    queue = multiprocessing.Queue()
    process_dict = {}

    for i_node in range(len(init_nodes)): #init_nodes is a list containing lists of pyomo instances
        for j_node in range(len(init_nodes[i_node])):
            
            process_name = str(i_node) + str(j_node)
            print(" - Data size:", sys.getsizeof(data)) #same for all of the args
            
            process_dict[process_name] = multiprocessing.Process(target=worker, args=(init_nodes[i_nodo][j_nodo], data, optsolver, queue, shared_incumbent_data))

            pr = cProfile.Profile()
            pr.enable()                 

            process_dict[process_name].start()

            pr.disable()
            ps = pstats.Stats(pr)
            ps.sort_stats('time').print_stats(5)

    for n_nodo in process_dict:
        process_dict[n_nodo].join(timeout=0)

#imports        
#[model definition]
#[data is obtained from 3 .tab files, the biggest one has a 30 x 40 matrix, with 1 to 3 digit integers]     
optsolver = SolverFactory("gurobi")

if __name__ == "__main__":
    foo(model, data, optsolver, 4)

Size of arguments obtained by sys.getsizeof() and profile of the .start() on the first computer

 - Data size: 56
 - Init_nodes size: 72
 - Queue size: 56
 - Shared incumbent data size: 56

         7150 function calls (7139 primitive calls) in 25.275 seconds

   Ordered by: internal time
   List reduced from 184 to 5 due to restriction <5>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        2   25.262   12.631   25.267   12.634 {method 'dump' of '_pickle.Pickler' objects}
        1    0.004    0.004    0.004    0.004 {built-in method _winapi.CreateProcess}
     1265    0.002    0.000    0.004    0.000 C:\Users\OLab\AppData\Local\Continuum\anaconda3\lib\site-packages\pyomo\core\expr\numeric_expr.py:186(__getstate__)
        2    0.001    0.001    0.002    0.001 <frozen importlib._bootstrap_external>:914(get_data)
     1338    0.001    0.000    0.002    0.000 C:\Users\OLab\AppData\Local\Continuum\anaconda3\lib\site-packages\pyomo\core\expr\numvalue.py:545(__getstate__)

Size of arguments obtained by sys.getsizeof() and profile of the .start() on the second computer

 - Data size: 56
 - Init_nodes size: 72
 - Queue size: 56
 - Shared incumbent data size: 56

         7257 function calls (7247 primitive calls) in 1.742 seconds

   Ordered by: internal time
   List reduced from 184 to 5 due to restriction <5>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        2    1.722    0.861    1.730    0.865 {method 'dump' of '_pickle.Pickler' objects}
        1    0.009    0.009    0.009    0.009 {built-in method _winapi.CreateProcess}
     1265    0.002    0.000    0.005    0.000 C:\Users\Palbo\Anaconda2\envs\py3\lib\site-packages\pyomo\core\expr\numeric_expr.py:186(__getstate__)
     1339    0.002    0.000    0.003    0.000 C:\Users\Palbo\Anaconda2\envs\py3\lib\site-packages\pyomo\core\expr\numvalue.py:545(__getstate__)
     1523    0.001    0.000    0.001    0.000 {built-in method builtins.hasattr}

Cheers!

The specs of the first computer that should be way faster but isn't:

Windows 10 Pro for Workstations
Intel Xeon Silver 4114 CPU @2.20 GHz 2.19 GHz (10 cores each)
64 GB RAM

Second computer specs:

Windows 8.1
Intel Core i3-2348M CPU @2.30 Ghz 2.30 Ghz (2 cores each)
6 GB RAM

Solution

Finally found a solution by dumping the pickling of the arguments of the function into a file, then passing the name of the file as an argument for the worker() function, then opening each file from within the function in each parallel process.

Dump time went down from ~24[s] to ~0.005[s]!

def worker(pickled_file_name, queue, shared_incumbent): 

    with open(pickled_file_name, "rb") as f:
        data_tuple = pickle.load(f, encoding='bytes')
    instance, data, optsolver, int_var_list, process_name, relaxed_incumbent = data_tuple
    return

def foo():
    [...]
    picklefile = open("pickled_vars"+str(i_nodo)+str(j_nodo)+".p", "wb") 
    picklefile.write(pickle.dumps(variables_,-1))
    picklefile.close()
                
    process_dict[process_name] = multiprocessing.Process(target=bnbparallelbranching, args=("pickled_vars"+str(i_nodo)+str(j_nodo)+".p", q, shared_incumbent_data))
    process_dict[process_name].start()