Search code examples
javaapache-sparkapache-spark-mllibapache-spark-ml

Java Spark ML - java.lang.IllegalArgumentException: label does not exist. Available:


Small question regarding a Spark exception I am getting please.

I have a very straightforward dataset:

myCoolDataset.show();
        +----------+-----+
        |      time|value|
        +----------+-----+
        |1621900800|   43|
        |1619568000|   41|
        |1620432000|   41|
        |1623974400|   42|
        |1620604800|   41|
      [truncated]
        |1621296000|   42|
        |1620691200|   44|
        |1620345600|   41|
        |1625702400|   44|
        +----------+-----+
        only showing top 20 rows


And I would like to perform a Linear Regression on it, in order to predict the next value for future time.

Therefore, I wrote the following, this is what I tried:

 VectorAssembler       vectorAssembler = new VectorAssembler().setInputCols(new String[]{"time", "value"}).setOutputCol("features");
        Dataset<Row>          vectorData      = vectorAssembler.transform(myCoolDataset);
        LinearRegression      lr              = new LinearRegression(); 
        LinearRegressionModel lrModel         = lr.fit(vectorData); // issue here

Unfortunately, at run time, I am getting this exception:

Exception in thread "main" java.lang.IllegalArgumentException: label does not exist. Available: time, value, features
        at org.apache.spark.sql.types.StructType.$anonfun$apply$1(StructType.scala:278)
        at scala.collection.immutable.Map$Map3.getOrElse(Map.scala:181)
        at org.apache.spark.sql.types.StructType.apply(StructType.scala:277)
        at org.apache.spark.ml.util.SchemaUtils$.checkNumericType(SchemaUtils.scala:75)
        at org.apache.spark.ml.PredictorParams.validateAndTransformSchema(Predictor.scala:54)
        at org.apache.spark.ml.PredictorParams.validateAndTransformSchema$(Predictor.scala:47)
        at org.apache.spark.ml.regression.LinearRegression.org$apache$spark$ml$regression$LinearRegressionParams$$super$validateAndTransformSchema(LinearRegression.scala:185)

May I ask what is the root cause, and how to fix this please?

Thank you


Solution

  • Mllib regressions expect to be passed the name of the column containing the label (what you want to predict). By default, regressions will consider a column named 'label'. In your particular example, you don't have such column.

    I see these solutions:

    • you can either call setLabelCol("value") on your LinearRegression instance.
    • you can also modify your dataset, and rename column 'value' into 'label' (method withColumnRenamed)
    • you can copy column 'value' into a new column named 'label' (method withColumn)