Search code examples
pysparkapache-spark-mllib

How to get the best hyperparameter value after crossvalidation in Pyspark?


I am doing cross validation on the dataset for some set of hyperparameters.

lr = LogisticRegression()
paramGrid = ParamGridBuilder() \
    .addGrid(lr.regParam, [0, 0.01, 0.05, 0.1, 0.5, 1]) \
    .addGrid(lr.elasticNetParam, [0.0, 0.1, 0.5, 0.8, 1]) \
    .build()
evaluator = BinaryClassificationEvaluator()
cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, evaluator=evaluator)

I want to know the best value for regParam and elasticNetParam. In python we have an option to get the best parameters after cross-validation. Is there any method in pyspark to get the best values for parameters after cross-validation?

For example : regParam - 0.05 
              elasticNetParam - 0.1

Solution

  • Well, you have to fit your CrossValidator first:

    cv_model = cv.fit(train_data)
    

    After you do that, you will have a best_model in:

    best_model = cv_model.bestModel
    

    To extract the parameters, you will have to do this ugly thing:

    best_reg_param = best_model._java_obj.getRegParam()
    best_elasticnet_param = best_model._java_obj.getElasticNetParam()