Search code examples
pysparkapache-spark-mllib

RandomForestClassifier has no attribute transform, so how to get predictions?


How do you get predictions out of a RandomForestClassifier? Loosely following the latest docs here, my code looks like...

# Split the data into training and test sets (30% held out for testing)
SPLIT_SEED = 64  # some const seed just for reproducibility
TRAIN_RATIO = 0.75
(trainingData, testData) = df.randomSplit([TRAIN_RATIO, 1-TRAIN_RATIO], seed=SPLIT_SEED)
print(f"Training set ({trainingData.count()}):")
trainingData.show(n=3)
print(f"Test set ({testData.count()}):")
testData.show(n=3)

# Train a RandomForest model.
rf = RandomForestClassifier(labelCol="labels", featuresCol="features", numTrees=36)

rf.fit(trainingData)
#print(rf.featureImportances)

preds = rf.transform(testData)

When running this, I get the error

AttributeError: 'RandomForestClassifier' object has no attribute 'transform'

Examining the python api docs, I see nothing that look like it relates to generating predictions from the trained model (nor feature importance for that matter). Not much experience with mllib, so not sure what to make of this. Anyone with more experience know what to do here?


Solution

  • by looking closely to the documentation

    >>> model = rf.fit(td)
    >>> model.featureImportances
    SparseVector(1, {0: 1.0})
    >>> allclose(model.treeWeights, [1.0, 1.0, 1.0])
    True
    >>> test0 = spark.createDataFrame([(Vectors.dense(-1.0),)], ["features"])
    >>> result = model.transform(test0).head()
    >>> result.prediction
    

    you will notice the rf.fit return fitted models which is different than the original RandomForestClassifier class.

    And the model will have the method to transform and also feature importance

    so in your code

    # Train a RandomForest model.
    rf = RandomForestClassifier(labelCol="labels", featuresCol="features", numTrees=36)
    
    model = rf.fit(trainingData)
    #print(rf.featureImportances)
    
    preds = model.transform(testData)