Search code examples
apache-sparkpysparkapache-spark-ml

Fail loading a ML PySpark model


I have a couple of regression models that I cannot load. This is the Spark init:

from pyspark.sql import SparkSession, SQLContext
from pyspark.ml.regression import DecisionTreeRegressor

spark = SparkSession.builder \
    .appName("Linear Regression Model") \
    .config('spark.executor.cores','2') \
    .config("spark.executor.memory", "5gb") \
    .master("local[*]") \
    .getOrCreate() 

sc = spark.sparkContext

Here is the model fitting and saving it succesfully:

# Decision Tree Regression
decisionTree = DecisionTreeRegressor(featuresCol = "Features", labelCol = "SalePrice", maxDepth = 15, maxBins = 32)
decisionTreeModel = decisionTree.fit(train_vector)

import os

decisionTreeModel.save(os.path.join(".", 'decisionTreeModel'))

But when I load it back:

persistedModel = DecisionTreeRegressor.load("decisionTreeModel")

I get this ERROR:

Py4JJavaError: An error occurred while calling o1201.load.
: java.lang.NoSuchMethodException: org.apache.spark.ml.regression.DecisionTreeRegressionModel.<init>(java.lang.String)
    at java.lang.Class.getConstructor0(Class.java:3082)
    at java.lang.Class.getConstructor(Class.java:1825)
    at org.apache.spark.ml.util.DefaultParamsReader.load(ReadWrite.scala:468)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:748)

Does anybody knows how to load a PySpark Model?


Solution

  • The error message is not very helpful, but I think the correct way to load the model back is to call the load method of the model, not of the estimator. The model is fitted to the data already, which is different from the estimator, which only contains the settings/parameters, but is not fitted.

    So you can try this:

    from pyspark.ml.regression import DecisionTreeRegressionModel
    
    persistedModel = DecisionTreeRegressionModel.load("decisionTreeModel")
    

    For your information, here's a comparison of loading the estimator vs loading the model:

    from pyspark.ml.regression import DecisionTreeRegressor, DecisionTreeRegressionModel
    
    decisionTree = DecisionTreeRegressor(featuresCol = "Features", labelCol = "SalePrice", maxDepth = 15, maxBins = 32)
    decisionTree.save('tree')
    persistedEstimator = DecisionTreeRegressor.load('tree')
    
    decisionTreeModel = decisionTree.fit(df)
    decisionTreeModel.save('model')
    persistedModel = DecisionTreeRegressionModel.load('model')