I am trying to tune a random forest model using pyspark, CrossValidator, and BinaryClassificationEvaluator, CrossValidator, but when I do so I get an error. Here is my code.
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.feature import VectorAssembler
from pyspark.ml import Pipeline
# Create a spark RandomForestClassifier using all default parameters.
# Create a training, and testing df
training_df, testing_df = raw_data_df.randomSplit([0.6, 0.4])
# build a pipeline for analysis
va = VectorAssembler().setInputCols(training_df.columns[0:110:]).setOutputCol('features')
# featuresCol="features"
rf = RandomForestClassifier(labelCol="quality")
# Train the model and calculate the AUC using a BinaryClassificationEvaluator
rf_pipeline = Pipeline(stages=[va, rf]).fit(training_df)
bce = BinaryClassificationEvaluator(labelCol="quality")
# Check AUC before tuning
bce.evaluate(rf_pipeline.transform(testing_df))
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
paramGrid = ParamGridBuilder().build()
crossValidator = CrossValidator(estimator=rf_pipeline,
estimatorParamMaps=paramGrid,
evaluator=bce,
numFolds=3)
model = crossValidator.fit(training_df)
It is throwing this error:
AttributeError: 'PipelineModel' object has no attribute 'fitMultiple'
How do I fix this issue?
CrossValidator estimator takes a object of Pipeline and not the Pipeline model.
Please check this example for reference- https://github.com/apache/spark/blob/master/examples/src/main/python/ml/cross_validator.py
rf_pipe = Pipeline(stages=[va, rf])
crossValidator = CrossValidator(estimator=rf_pipe,
estimatorParamMaps=paramGrid,
evaluator=bce,
numFolds=3)
Oveall-
....
# Train the model and calculate the AUC using a BinaryClassificationEvaluator
rf_pipe = Pipeline(stages=[va, rf])
rf_pipeline = rf_pipe.fit(training_df)
...
crossValidator = CrossValidator(estimator=**rf_pipe**,
estimatorParamMaps=paramGrid,
evaluator=bce,
numFolds=3)
model = crossValidator.fit(training_df)