Search code examples
pythonapache-sparkpysparkapache-spark-mllibapache-spark-ml

pyspark ML LabeledPoint not working with LinearRegression


I'm studying Spark 3.0.1 with pyspark, and have setup some data for simple OLS regression using

data = results.select('OrderMonthYear', 'SaleAmount').rdd.map(lambda row: LabeledPoint(row[1], [row[0]])).toDF()

The OrderMonthYear is my feature column (int), and SaleAmount is the response (float). The LabeledPoint method was imported from pyspark.mllib.regression. I then try to fit the regression model with

from pyspark.ml.regression import LinearRegression
lr = LinearRegression()
modelA = lr.fit(data, {lr.regParam:0.0})

to get this exception

IllegalArgumentException: requirement failed: Column features must be of type struct<type:tinyint,size:int,indices:array<int>,values:array<double>> but was actually struct<type:tinyint,size:int,indices:array<int>,values:array<double>>.

This is clearly not very helpful, as the required and passed features seem to be the same structs. I've searched online, and only found answers to this problem for java, or for someone building the struct themselves. The exception was thrown from a util function that was just throwing a java exception (#Hide where the exception came from that shows a non-Pythonic JVM exception message.), so I can't debug further.


Solution

  • MLlib and RDD-based MLlib functions are deprecated. I suggest using vector assembler of ML:

    from pyspark.ml.feature import VectorAssembler
    from pyspark.ml.regression import LinearRegression
    
    data = spark.createDataFrame([[0,1],[1,2],[2,3]]).toDF('OrderMonthYear', 'SaleAmount')
    
    va = VectorAssembler(inputCols=['SaleAmount'], outputCol='features')
    data2 = va.transform(data)
    
    lr = LinearRegression(labelCol='OrderMonthYear')
    model = lr.fit(data2)