Search code examples
machine-learningpysparkapache-spark-ml

Pyspark trained Logistic Regression model doesn't predict() and predictProbability() function


I trained a Logistic Regression model with PySpark MLlib built-in class LogisticRegression. However, when it was trained, it couldn't be used to predict other dataframes because AttributeError: 'LogisticRegression' object has no attribute 'predictProbability' OR AttributeError: 'LogisticRegression' object has no attribute 'predict'.

from pyspark.ml.classification import LogisticRegression
model = LogisticRegression(regParam=0.5, elasticNetParam=1.0)

# define the input feature & output column
model.setFeaturesCol('features')
model.setLabelCol('WinA')

model.fit(df_train)

model.setPredictionCol('WinA')
model.predictProbability(df_val['features'])
model.predict(df_val['features'])
AttributeError: 'LogisticRegression' object has no attribute 'predictProbability'

Properties:

PySpark version:

>>import pyspark
>>pyspark.__version__
3.1.2

JDK version:

>>!java -version
openjdk version "11.0.11" 2021-04-20
OpenJDK Runtime Environment (build 11.0.11+9-Ubuntu-0ubuntu2.18.04)
OpenJDK 64-Bit Server VM (build 11.0.11+9-Ubuntu-0ubuntu2.18.04, mixed mode, sharing)

Environment: Google Colab


Solution

  • Your code here

    model.fit(df_train)
    

    did not actually give you a trained model since the type of variable model is still pyspark.ml.classification.LogisticRegression class

    type(model)
    
    # pyspark.ml.classification.LogisticRegression
    

    So, you should catch the returned object by assigning it to a variable or overwriting your model variable, then it will give you the trained logistic regression model of pyspark.ml.classification.LogisticRegressionModel class

    model = model.fit(df_train)
    type(model)
    
    # pyspark.ml.classification.LogisticRegressionModel
    

    Finally, .predict and .predictProbability methods need an argument of a pyspark.ml.linalg.DenseVector object. So, I think you want to use .transform instead since it will add predicted label and probability as columns to the input dataframe. It would be like this

    predicted_df = model.transform(df_val)