I was playing around with the save
and load
functions of pyspark.ml.classification
models. I created an instance of a RandomForestClassifier
, set values to a couple of parameters and called the save
method of the classifier. It saves successfully. No issues there.
from pyspark.ml.classification import RandomForestClassifier
# save
rf = RandomForestClassifier()
rf.setImpurity('entropy')
rf.setPredictionCol('predme')
rf.write().overwrite().save('rf_test')
Then I tried loading it back but I noticed that its parameters don't have the values I had set before saving. Below is the code I was trying
# load
rf2 = RandomForestClassifier()
rf2.load('rf_test')
print(rf2.getImpurity()) # returns gini
print(rf2.getPredictionCol()) # returns prediction
I guess there's a difference in my understanding of how this code should work and how it actually works.
What should I do to get back the object the way I had saved it?
EDIT
I tried the approach mentioned here. But that didn't work. This is what I tried
from pyspark.ml.classification import RandomForestClassifier
rf = RandomForestClassifier()
rf.setImpurity('entropy')
rf.setPredictionCol('predme')
rf.write().overwrite().save('rf_test')
rf2 = RandomForestClassifier
rf2.load('rf_test')
print(rf2.getImpurity())
which returned the following
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: getImpurity() missing 1 required positional argument: 'self'
That's not how you should use load
method. It is a classmethod
and should be called on a class object, not an instance, to return a new object:
rf2 = RandomForestClassifier.load('rf_test')
rf2.getImpurity()
Technically speaking calling it on an instance would work as well, but it doesn't modify the caller, but returns a new independent object:
rf2 = RandomForestClassifier().load('rf_test')
In practice though, such construct should be avoided.