Search code examples
pysparkmodelingcross-validationapache-spark-mllibapache-spark-ml

How to extract model hyper-parameters from spark.ml in PySpark?


I'm tinkering with some cross-validation code from the PySpark documentation, and trying to get PySpark to tell me what model was selected:

from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.mllib.linalg import Vectors
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

dataset = sqlContext.createDataFrame(
    [(Vectors.dense([0.0]), 0.0),
     (Vectors.dense([0.4]), 1.0),
     (Vectors.dense([0.5]), 0.0),
     (Vectors.dense([0.6]), 1.0),
     (Vectors.dense([1.0]), 1.0)] * 10,
    ["features", "label"])
lr = LogisticRegression()
grid = ParamGridBuilder().addGrid(lr.regParam, [0.1, 0.01, 0.001, 0.0001]).build()
evaluator = BinaryClassificationEvaluator()
cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, evaluator=evaluator)
cvModel = cv.fit(dataset)

Running this in PySpark shell, I can get the linear regression model's coefficients, but I can't seem to find the value of lr.regParam selected by the cross validation procedure. Any ideas?

In [3]: cvModel.bestModel.coefficients
Out[3]: DenseVector([3.1573])

In [4]: cvModel.bestModel.explainParams()
Out[4]: ''

In [5]: cvModel.bestModel.extractParamMap()
Out[5]: {}

In [15]: cvModel.params
Out[15]: []

In [36]: cvModel.bestModel.params
Out[36]: []

Solution

  • Ran into this problem as well. I found out you need to call the java property for some reason I don't know why. So just do this:

    from pyspark.ml.tuning import TrainValidationSplit, ParamGridBuilder, CrossValidator
    from pyspark.ml.regression import LinearRegression
    from pyspark.ml.evaluation import RegressionEvaluator
    
    evaluator = RegressionEvaluator(metricName="mae")
    lr = LinearRegression()
    grid = ParamGridBuilder().addGrid(lr.maxIter, [500]) \
                                    .addGrid(lr.regParam, [0]) \
                                    .addGrid(lr.elasticNetParam, [1]) \
                                    .build()
    lr_cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, \
                            evaluator=evaluator, numFolds=3)
    lrModel = lr_cv.fit(your_training_set_here)
    bestModel = lrModel.bestModel
    

    Printing out the parameters you want:

    >>> print 'Best Param (regParam): ', bestModel._java_obj.getRegParam()
    0
    >>> print 'Best Param (MaxIter): ', bestModel._java_obj.getMaxIter()
    500
    >>> print 'Best Param (elasticNetParam): ', bestModel._java_obj.getElasticNetParam()
    1
    

    This applies to other methods like extractParamMap() as well. They should fix this soon.