I'm studying Spark 3.0.1 with pyspark, and have setup some data for simple OLS regression using
data = results.select('OrderMonthYear', 'SaleAmount').rdd.map(lambda row: LabeledPoint(row[1], [row[0]])).toDF()
The OrderMonthYear
is my feature column (int), and SaleAmount
is the response (float). The LabeledPoint
method was imported from pyspark.mllib.regression
. I then try to fit the regression model with
from pyspark.ml.regression import LinearRegression
lr = LinearRegression()
modelA = lr.fit(data, {lr.regParam:0.0})
to get this exception
IllegalArgumentException: requirement failed: Column features must be of type struct<type:tinyint,size:int,indices:array<int>,values:array<double>> but was actually struct<type:tinyint,size:int,indices:array<int>,values:array<double>>.
This is clearly not very helpful, as the required and passed features seem to be the same structs. I've searched online, and only found answers to this problem for java, or for someone building the struct themselves. The exception was thrown from a util function that was just throwing a java exception (#Hide where the exception came from that shows a non-Pythonic JVM exception message.
), so I can't debug further.
MLlib and RDD-based MLlib functions are deprecated. I suggest using vector assembler of ML:
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression
data = spark.createDataFrame([[0,1],[1,2],[2,3]]).toDF('OrderMonthYear', 'SaleAmount')
va = VectorAssembler(inputCols=['SaleAmount'], outputCol='features')
data2 = va.transform(data)
lr = LinearRegression(labelCol='OrderMonthYear')
model = lr.fit(data2)