java apache-spark apache-spark-mllib apache-spark-ml

Java Spark ML - java.lang.IllegalArgumentException: label does not exist. Available:

Small question regarding a Spark exception I am getting please.

I have a very straightforward dataset:

myCoolDataset.show();
        +----------+-----+
        |      time|value|
        +----------+-----+
        |1621900800|   43|
        |1619568000|   41|
        |1620432000|   41|
        |1623974400|   42|
        |1620604800|   41|
      [truncated]
        |1621296000|   42|
        |1620691200|   44|
        |1620345600|   41|
        |1625702400|   44|
        +----------+-----+
        only showing top 20 rows

And I would like to perform a Linear Regression on it, in order to predict the next value for future time.

Therefore, I wrote the following, this is what I tried:

 VectorAssembler       vectorAssembler = new VectorAssembler().setInputCols(new String[]{"time", "value"}).setOutputCol("features");
        Dataset<Row>          vectorData      = vectorAssembler.transform(myCoolDataset);
        LinearRegression      lr              = new LinearRegression(); 
        LinearRegressionModel lrModel         = lr.fit(vectorData); // issue here

Unfortunately, at run time, I am getting this exception:

Exception in thread "main" java.lang.IllegalArgumentException: label does not exist. Available: time, value, features
        at org.apache.spark.sql.types.StructType.$anonfun$apply$1(StructType.scala:278)
        at scala.collection.immutable.Map$Map3.getOrElse(Map.scala:181)
        at org.apache.spark.sql.types.StructType.apply(StructType.scala:277)
        at org.apache.spark.ml.util.SchemaUtils$.checkNumericType(SchemaUtils.scala:75)
        at org.apache.spark.ml.PredictorParams.validateAndTransformSchema(Predictor.scala:54)
        at org.apache.spark.ml.PredictorParams.validateAndTransformSchema$(Predictor.scala:47)
        at org.apache.spark.ml.regression.LinearRegression.org$apache$spark$ml$regression$LinearRegressionParams$$super$validateAndTransformSchema(LinearRegression.scala:185)

May I ask what is the root cause, and how to fix this please?

Thank you

Solution

Mllib regressions expect to be passed the name of the column containing the label (what you want to predict). By default, regressions will consider a column named 'label'. In your particular example, you don't have such column.

I see these solutions:

you can either call setLabelCol("value") on your LinearRegression instance.
you can also modify your dataset, and rename column 'value' into 'label' (method withColumnRenamed)
you can copy column 'value' into a new column named 'label' (method withColumn)