Search code examples
machine-learningpysparklinear-regressionapache-spark-mllibapache-spark-ml

ParamGridBuilder in PySpark does not work with LinearRegressionSGD


I'm trying to figure out why LinearRegressionWithSGD does not work with Spark's ParamGridBuilder. From the Spark documentation:

lr = LinearRegression(maxIter=10)
paramGrid = ParamGridBuilder()\
.addGrid(lr.regParam, [0.1, 0.01]) \
.addGrid(lr.fitIntercept, [False, True])\
.addGrid(lr.elasticNetParam, [0.0, 0.5, 1.0])\
.build()

However, changing LinearRegression to LinearRegressionWithSGD simply does not work. Subsequently SGD parameters are also unable to be passed in (such as iterations or minibatchfraction).

Thanks!!


Solution

  • That is because you are trying to mix functionality from two different libraries: LinearRegressionWithSGD comes from pyspark.mllib (i.e. the old, RDD-based API), while both LinearRegression & ParamGridBuilder come from pyspark.ml (the new, dataframe-based API).

    Indeed, a few lines before the code snippet in the documentation you quote (BTW, in the future it would be good to provide a link, too) you'll find the line:

    from pyspark.ml.regression import LinearRegression
    

    while for LinearRegressionWithSGD you have used something like:

    from pyspark.mllib.regression import LabeledPoint, LinearRegressionWithSGD, LinearRegressionModel
    

    These two libraries are not compatible: pyspark.mllib takes RDD's of LabeledPoint as input, which is not compatible with the dataframes used in pyspark.ml; and since ParamGridBuilder is part of the latter, it can only be used with dataframes, and not with algorithms included in pyspark.mllib (check the documentation links provided above).

    Moreover, keep in mind that LinearRegressionWithSGD is deprecated in Spark 2:

    Note: Deprecated in 2.0.0. Use ml.classification.LogisticRegression or LogisticRegressionWithLBFGS.

    UPDATE: Thanks to @rvisio's comment below, we know now that, although undocumented, one can actually use solver='sgd' for LinearRegression in pyspark.ml; here is a short example adapted from the docs:

    spark.version
    # u'2.2.0'
    
    from pyspark.ml.linalg import Vectors
    from pyspark.ml.regression import LinearRegression
    
    df = spark.createDataFrame([
         (1.0, 2.0, Vectors.dense(1.0)),
         (0.0, 2.0, Vectors.sparse(1, [], []))], ["label", "weight", "features"])
    lr = LinearRegression(maxIter=5, regParam=0.0, solver="sgd", weightCol="weight") # solver='sgd'
    model = lr.fit(df) # works OK
    lr.getSolver()
    # 'sgd'