Search code examples
apache-sparkapache-spark-ml

LinearRegression scala.MatchError:


I am getting a scala.MatchError when using a ParamGridBuilder in Spark 1.6.1 and 2.0

val paramGrid = new ParamGridBuilder()
  .addGrid(lr.regParam, Array(0.1, 0.01))
  .addGrid(lr.fitIntercept)
  .addGrid(lr.elasticNetParam, Array(0.0, 0.5, 1.0))
  .build()

Error is

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 57.0 failed 1 times, most recent failure: Lost task 0.0 in stage 57.0 (TID 257, localhost): 
scala.MatchError: [280000,1.0,[2400.0,9373.0,3.0,1.0,1.0,0.0,0.0,0.0]] (of class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)

Full code

The question is how I should use ParamGridBuilder in this case


Solution

  • Problem here is input schema not ParamGridBuilder. Price column is loaded as an integer while LinearRegression is expecting a double. You can fix it by explicitly casting column to required type:

    val houses = sqlContext.read.format("com.databricks.spark.csv")
      .option("header", "true")
      .option("inferSchema", "true")
      .load(...)
      .withColumn("price", $"price".cast("double"))