Search code examples
pythondill

Why does recurse=True cause dill not to respect globals in functions?


If I pickle a function with dill that contains a global, somehow that global state isn't respected when the function is loaded again. I don't understand enough about dill to be anymore specific, but take this working code for example:

import multiprocessing
import dill

def initializer():
    global foo
    foo = 1

def worker(arg):
    return foo
   
with multiprocessing.Pool(2, initializer) as pool:
    res = pool.map(worker, range(10))

print(res)

This works fine, and prints [1, 1] as expected. However, if I instead pickle the initializer and worker functions using dill's recurse=True, and then restore them, it fails:

import multiprocessing
import dill

def initializer():
    global foo
    foo = 1

def worker(arg):
    return foo

with open('funcs.pkl', 'wb') as f:
    dill.dump((initializer, worker), f, recurse=True)

with open('funcs.pkl', 'rb') as f:
    initializer, worker = dill.load(f)

with multiprocessing.Pool(2, initializer) as pool:
    res = pool.map(worker, range(2))

This code fails with the following error:

  File "/tmp/ipykernel_158597/1183951641.py", line 9, in worker
    return foo
           ^^^
NameError: name 'foo' is not defined

If I use recurse=False it works fine, but somehow pickling them in this way causes the code to break. Why?


Solution

  • With the recurse=True option, dill.dump builds a new globals dict for the function being serialized with objects that the function refers to also recursively serialized. The side effect is that when deserialized with dill.load, these objects are reconstructed as new objects, including the globals dict for the function.

    This is why, after deserialization, the globals dicts of the functions become different objects from each other, so that changes made to the globals dict of the initializer function have no effect on the globals dict of the worker function.

    You can verify this behavior by checking the identity of the global namespace in which a function object is defined and runs under, availble as the __globals__ attribute of the function object:

    import dill
    
    def initializer():
        global foo
        foo = 1
    
    def worker(arg):
        return foo
    
    print(id(globals()))
    print(id(initializer.__globals__))
    print(id(worker.__globals__))
    
    with open('funcs.pkl', 'wb') as f:
        dill.dump((initializer, worker), f, recurse=True)
    
    with open('funcs.pkl', 'rb') as f:
        initializer, worker = dill.load(f)
    
    print('-- dilled --')
    print(id(globals()))
    print(id(initializer.__globals__))
    print(id(worker.__globals__))
    

    This outputs something like:

    124817730351552
    124817730351552
    124817730351552
    -- dilled --
    124817730351552
    124817727897280
    124817728060352
    

    Demo: https://replit.com/@blhsing1/RelievedPrimeLaws