I'm trying to use dills' dill.detect.trace(True)
functionality - which prints a trace during pickling - to find out why my unpickling process is so slow. I haven't found a way to output the trace for unpickling, but I am assuming this is more or less the 1-to-1 reverse process of pickling, and that inspecting the pickling trace should thus reveal bottlenecks which also hold for unpickling.
I don't know how to interpret its output, however. For example:
T4: <class 'foo_package.base.model.BarModel'>
# T4
D2: <dict object at 0x7f7de3832800>
T4: <class 'pathlib.PosixPath'>
# T4
T4: <class 'foo_package.some_module.Bar'>
# T4
D2: <dict object at 0x7f7d206273c0>
T4: <class 'some_package.some_module.Bar.options.SessionOptions'>
# T4
D2: <dict object at 0x7f7d20384040>
T4: <enum 'ModelType'>
# T4
T4: <enum 'SomeOtherEnum'>
# T4
# D2
D2: <dict object at 0x7f7d206b9ac0>
# D2
T4: <class 'some_package.some_other_module.XyzzyObject'>
# T4
D2: <dict object at 0x7f7d206cb100>
###... and so on...
How should one interpret D2
and T4
? And what do the #
prefixes mean? Can hints be found on the existence of cyclical references/deeply nested structures which increase (de)serialization time?
Dills documentation on the trace
function merely states:
trace(boolean)
print a trace through the stack when pickling; useful for debugging
Further context: (lowering risk of XY problem)
I'm using dill to store a partial state of a python program's instances in order to save an analytical model.
(Un)pickling has become very slow over time, and I'm trying to identify the cause.
I have already delegated storage of objects with dedicated serialization methods (e.g. numpy/pyarrow/pandas objects) to other serialization methods using __getstate__
and __setstate__
for objects containing significant amounts of data. This helped a little, but it still takes 1 to 2 minutes to deserialize, which slows down the debugging process quite a lot.
I'm the dill author. This is copied from the README on GitHub and PyPI...
To aid in debugging pickling issues, use dill.detect which provides tools like pickle tracing::
>>> import dill.detect
>>> dill.detect.trace(True)
>>> f = dumps(squared)
F1: <function <lambda> at 0x108899e18>
F2: <function _create_function at 0x108db7488>
# F2
Co: <code object <lambda> at 0x10866a270, file "<stdin>", line 1>
F2: <function _create_code at 0x108db7510>
# F2
# Co
D1: <dict object at 0x10862b3f0>
# D1
D2: <dict object at 0x108e42ee8>
# D2
# F1
>>> dill.detect.trace(False)
With trace, we see how dill
stored the lambda (F1
) by first storing
_create_function
, the underlying code object (Co
) and _create_code
(which is used to handle code objects), then we handle the reference to
the global dict (D2
). A #
marks when the object is actually stored.