I have a quite complex python program (say more than 5.000 rows) written with Python 3.6. This program parses a huge dataset of more than 5.000 files, processes them creating an internal representation of the dataset and then creates statistics. Since I have to test the model, I need to save the dataset representation and at now I'm doing it by using serialization through dill
(in the representation there are objects that pickle
does not support). The serialization of the whole dataset, not compressed, takes about 1GB.
Now, I would like to speed up computation by parallelization. The perfect way would be a multithreading approach but GIL forbid that. multiprocessing
module (and multiprocess
- which is dill
compatible - too) uses serialization to share complex objects between processes so that, in the best case I managed to invent, parallelization is ininfluent for me on time performance because of the huge size of the dataset.
What is the best way to manage this situation?
I know about posh
, but it seems to be only x86
compatible, ray
but it uses serialization too, gilectomy
(a version of python without gil) but I'm not able to make it parallelize threads and Jython
which has no GIL but is not compatible with python 3.x.
I am open to any alternative, any language, however complex it may be, but I can't rewrite the code from scratch.
Best solution I found is change dill
to a custom pickling module based on standard pickle
. See here: Python 3.6 pickling custom procedure