Search code examples
apache-sparkpysparkapache-spark-ml

PySpark issues with loading unfit model object


I was playing around with the save and load functions of pyspark.ml.classification models. I created an instance of a RandomForestClassifier, set values to a couple of parameters and called the save method of the classifier. It saves successfully. No issues there.

from pyspark.ml.classification import RandomForestClassifier
# save
rf = RandomForestClassifier()
rf.setImpurity('entropy')
rf.setPredictionCol('predme')
rf.write().overwrite().save('rf_test')

Then I tried loading it back but I noticed that its parameters don't have the values I had set before saving. Below is the code I was trying

# load
rf2 = RandomForestClassifier()
rf2.load('rf_test')
print(rf2.getImpurity()) # returns gini
print(rf2.getPredictionCol())  # returns prediction

I guess there's a difference in my understanding of how this code should work and how it actually works.

What should I do to get back the object the way I had saved it?

EDIT

I tried the approach mentioned here. But that didn't work. This is what I tried

from pyspark.ml.classification import RandomForestClassifier
rf = RandomForestClassifier()
rf.setImpurity('entropy')
rf.setPredictionCol('predme')
rf.write().overwrite().save('rf_test')
rf2 = RandomForestClassifier
rf2.load('rf_test')
print(rf2.getImpurity())

which returned the following

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: getImpurity() missing 1 required positional argument: 'self'

Solution

  • That's not how you should use load method. It is a classmethod and should be called on a class object, not an instance, to return a new object:

    rf2 = RandomForestClassifier.load('rf_test')
    rf2.getImpurity()
    

    Technically speaking calling it on an instance would work as well, but it doesn't modify the caller, but returns a new independent object:

    rf2 = RandomForestClassifier().load('rf_test')
    

    In practice though, such construct should be avoided.