When using a function working on a global variable in Joblib, the global variable is reachable from the function without any copy on Linux.
We can test this in the following script:
import joblib
import numpy as np
print("Initializing global")
# Let's create a global that is big, so it takes time to create it
my_global = np.random.uniform(0,100, size=(10**4, 10**4))
print("done")
# A simple function working on the global variable
def fun_with_global():
return id(my_global)
print("starting // loop")
joblib.Parallel(n_jobs=3, backend="multiprocessing", verbose=100)((joblib.delayed(fun_with_global)() for i in range(1000)))
joblib.Parallel(n_jobs=3, backend="loky", verbose=100)((joblib.delayed(fun_with_global)() for i in range(1000)))
# We get that the two last parallel calls execute almost instantly, even for 1000 jobs.
# When we instead return id(my_global.copy()) in fun_with_global, here we see the copy operation is lengthy.
The fact that the calls to Parallel
are almost instant means there is no pickling/unpickling of the global variable.
This behavior is actually backend-dependant:
multiprocessing
backend, this is totally logical, since the multiprocessing workers fork the original process, meaning the global variable my_global
is already present in the worker memory without efforts.loky
backend, it's stated in the documentation that loky workers fork/exec, meaning they don't have access to this global variable easily.So, how does Loky does have access to global variables from the parent process, without creating a copy or fork?
EDIT: The example below only works with a global variable that is based on a numpy array. With another variable, there is a different behavior:
import joblib
import numpy as np
import pandas as pd
import time
print("Initializing global")
# This time, let's create a big variable, that is not based on np arrays
with open("/dev/urandom", "rb") as fd:
my_global = fd.read(10**9)
print("done")
def fun_with_global():
return id(my_global)
print("starting // loop")
joblib.Parallel(n_jobs=3, backend="multiprocessing", verbose=100)((joblib.delayed(fun_with_global)() for i in range(1000)))
joblib.Parallel(n_jobs=3, backend="loky", verbose=100)((joblib.delayed(fun_with_global)() for i in range(1000)))
# Here the multiprocessing backend still executes instantly, but the Loky backends is slow
Actually Loky is also serializing the global variables of the function, in addition to the function, as we see here. So this means that all used global variables get sent through the serialization methods to the workers.
Additionally, the observed install calls for the Loky backend in the above example is only work in the case of numpy arrays. In the case of numpy arrays, Loky is smartly memmapping the numpy arrays, making the transfer of the arrays to the other processes instant. When using variables other than numpy arrays, Loky has to serialize and deserialize the whole variable (not a memmap handle), that takes way more time, as seen in the second example above.