Search code examples
pythondill

Python/Dill serialization hash depending on imported packages?


Consider the following code:

from os.path import join
import dill
from tempfile import TemporaryDirectory
import hashlib

def filehash(path):
    with open(path, 'rb') as f:
        return hashlib.sha256(f.read()).hexdigest()

def func(a,b):
    return a + b
    
with TemporaryDirectory() as td:
    temp = join(td, "func.tmp")
    with open(temp, "wb") as f:
        dill.dump(func, f)
    print(filehash(temp))

This serializes a simple function func() to disk and then prints the hash of the resulting file.

Now, add an import statement of some package that won't be used before the first line, e.g. import numpy and execute the whole program again. Now the file hash is different.

Could somebody tell me why that is?


Solution

  • When Dill pickles a function it has to save the scopes that the function can access. So when you add the import, what is saved also changes because it includes the module scope which was changed by the import.

    If you don't want that, I recommend putting the functions you are going to dill in a module of their own. So that their module scope does not contain anything they don't need to access.

    I would also recommend not depending on the same code producing the same dill.