I am working on a project with the objective of separating the processes of training and testing in machine learning projects. I designed the code to wrap the used model, and by model I mean a classifier for instance, in the class Model.
class Model:
def init(self, newModel):
self.model = newModel
Then I pass the function objects the model has to provide using a list:
def addFunctions(self,functions):
for function in functions:
self.functions[function.__name_ _] = function
Now that model can be used for classification for instance by constructing it with a classifier object and passing its functions in a list to addFunctions so that I can invoke them. Then I package the model and the code in a docker container. To simplify what it does, it is a lightweight virtual machine.
The purpose of the separation is to just pass the trained model to the docker container after optimizing it without the need of passing the whole code. Thus, the need for saving/serializing the Python Model arises.
I tried using pickle as well as jsonpickle, but both of them had limitations when serializing certain types of objects. I could not find any alternative that is generic enough for object storage and retrieval. Are there any alternatives?
Both dill
and cloudpickle
are very robust serializers, and can serialize almost any objects in standard python. (I'm the dill
author, btw.)
dill
is available as a standalone package at:
https://github.com/uqfoundation/dill/
while cloudpickle
has pretty much died (it was supported by picloud
, but they went commercial… and has left pyspark
and a few other packages supporting it inside their own codebases):
https://github.com/apache/spark/blob/master/python/pyspark/cloudpickle.py
I use dill
as the backbone of parallel and distributed computing in statistical computing and optimization, and have used it to enable parallel Machine Learning techniques. I haven't tried docker
objects however.