machine-learning pyspark pipeline cross-validation apache-spark-ml

Find the best pipeline model using CrossValidator and ParamGridBuilder

I have an acceptable model, but I would like to improve it by adjusting its parameters in Spark ML Pipeline with CrossValidator and ParamGridBuilder.

As an Estimator I will place the existing pipeline. In ParamMaps I would not know what to put, I do not understand it. As Evaluator I will use the RegressionEvaluator already created previously.

I'm going to do it for 5 folds, with a list of 10 different depth values in the tree.

How can I select and show the best model for the lowest RMSE?

ACTUAL example:

    from pyspark.ml import Pipeline
    from pyspark.ml.regression import DecisionTreeRegressor
    from pyspark.ml.feature import VectorIndexer
    from pyspark.ml.evaluation import RegressionEvaluator

    dt = DecisionTreeRegressor()
    dt.setPredictionCol("Predicted_PE")
    dt.setMaxBins(100)
    dt.setFeaturesCol("features")
    dt.setLabelCol("PE")
    dt.setMaxDepth(8)

    pipeline = Pipeline(stages=[vectorizer, dt])
    model = pipeline.fit(trainingSetDF)
    regEval = RegressionEvaluator(predictionCol = "Predicted_XX", labelCol = "XX", metricName = "rmse")
    rmse = regEval.evaluate(predictions)

    print("Root Mean Squared Error: %.2f" % rmse)
    (1) Spark Jobs 
    (2) Root Mean Squared Error: 3.60

NEED:

    from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

    dt2 = DecisionTreeRegressor()
    dt2.setPredictionCol("Predicted_PE")
    dt2.setMaxBins(100)
    dt2.setFeaturesCol("features")
    dt2.setLabelCol("PE")
    dt2.setMaxDepth(10)

    pipeline2 = Pipeline(stages=[vectorizer, dt2])
    model2 = pipeline2.fit(trainingSetDF)
    regEval2 = RegressionEvaluator(predictionCol = "Predicted_PE", labelCol = "PE", metricName = "rmse")

    paramGrid = ParamGridBuilder().build() # ??????
    crossval = CrossValidator(estimator = pipeline2, estimatorParamMaps = paramGrid, evaluator=regEval2, numFolds = 5) # ?????

    rmse2 = regEval2.evaluate(predictions)

    #bestPipeline = ????
    #bestLRModel = ????
    #bestParams = ????

    print("Root Mean Squared Error: %.2f" % rmse2)
    (1) Spark Jobs 
    (2) Root Mean Squared Error: 3.60     # the same ¿?

Solution

You need to call .fit() with your training data on the crossval object to create the cv model. That will do the cross validation. Then you get the best model (according to your evaluator metric) from that. Eg.

cvModel = crossval.fit(trainingData) myBestModel = cvModel.bestModel