I'm trying to figure out why LinearRegressionWithSGD
does not work with Spark's ParamGridBuilder
. From the Spark documentation:
lr = LinearRegression(maxIter=10)
paramGrid = ParamGridBuilder()\
.addGrid(lr.regParam, [0.1, 0.01]) \
.addGrid(lr.fitIntercept, [False, True])\
.addGrid(lr.elasticNetParam, [0.0, 0.5, 1.0])\
.build()
However, changing LinearRegression
to LinearRegressionWithSGD
simply does not work. Subsequently SGD parameters are also unable to be passed in (such as iterations or minibatchfraction).
Thanks!!
That is because you are trying to mix functionality from two different libraries: LinearRegressionWithSGD
comes from pyspark.mllib
(i.e. the old, RDD-based API), while both LinearRegression
& ParamGridBuilder
come from pyspark.ml
(the new, dataframe-based API).
Indeed, a few lines before the code snippet in the documentation you quote (BTW, in the future it would be good to provide a link, too) you'll find the line:
from pyspark.ml.regression import LinearRegression
while for LinearRegressionWithSGD
you have used something like:
from pyspark.mllib.regression import LabeledPoint, LinearRegressionWithSGD, LinearRegressionModel
These two libraries are not compatible: pyspark.mllib
takes RDD's of LabeledPoint
as input, which is not compatible with the dataframes used in pyspark.ml
; and since ParamGridBuilder
is part of the latter, it can only be used with dataframes, and not with algorithms included in pyspark.mllib
(check the documentation links provided above).
Moreover, keep in mind that LinearRegressionWithSGD
is deprecated in Spark 2:
Note: Deprecated in 2.0.0. Use ml.classification.LogisticRegression or LogisticRegressionWithLBFGS.
UPDATE: Thanks to @rvisio's comment below, we know now that, although undocumented, one can actually use solver='sgd'
for LinearRegression
in pyspark.ml
; here is a short example adapted from the docs:
spark.version
# u'2.2.0'
from pyspark.ml.linalg import Vectors
from pyspark.ml.regression import LinearRegression
df = spark.createDataFrame([
(1.0, 2.0, Vectors.dense(1.0)),
(0.0, 2.0, Vectors.sparse(1, [], []))], ["label", "weight", "features"])
lr = LinearRegression(maxIter=5, regParam=0.0, solver="sgd", weightCol="weight") # solver='sgd'
model = lr.fit(df) # works OK
lr.getSolver()
# 'sgd'