Search code examples
scalaapache-sparkapache-spark-sqlapache-spark-dataset

Apache Spark: Issues with Extracting Values from Row


I've been having a lot of issues with the Row class in Spark. It seems to me Row class is one real badly designed class. It should really be no more difficult to extract a value from a Row than extracting a value from a Scala list; but in practice, you have to know the exact type of the column in order to extract it. You can't even turn the columns into String; how ridiculous is that for a great framework such as Spark? In real world, in most cases, you don't know the exact type of the column, and on top of that in many cases, you have dozens or hundreds of columns. The following is an example to show you the ClassCastExceptions that I've been getting.

Does anyone have any solution for easy extraction of values from a Row?

scala> val df = List((1,2),(3,4)).toDF("col1","col2")
df: org.apache.spark.sql.DataFrame = [col1: int, col2: int]


scala> df.first.getAs[String]("col1")
java.lang.ClassCastException: java.lang.Integer cannot be cast to java.lang.String
  ... 56 elided

scala> df.first.getAs[Int]("col1")
res12: Int = 1

scala> df.first.getInt(0)
res13: Int = 1

scala> df.first.getLong(0)
java.lang.ClassCastException: java.lang.Integer cannot be cast to java.lang.Long
  at scala.runtime.BoxesRunTime.unboxToLong(BoxesRunTime.java:105)
  at org.apache.spark.sql.Row$class.getLong(Row.scala:231)
  at org.apache.spark.sql.catalyst.expressions.GenericRow.getLong(rows.scala:165)
  ... 56 elided

scala> df.first.getFloat(0)
java.lang.ClassCastException: java.lang.Integer cannot be cast to java.lang.Float
  at scala.runtime.BoxesRunTime.unboxToFloat(BoxesRunTime.java:109)
  at org.apache.spark.sql.Row$class.getFloat(Row.scala:240)
  at org.apache.spark.sql.catalyst.expressions.GenericRow.getFloat(rows.scala:165)
  ... 56 elided

scala> df.first.getString(0)
java.lang.ClassCastException: java.lang.Integer cannot be cast to java.lang.String
  at org.apache.spark.sql.Row$class.getString(Row.scala:255)
  at org.apache.spark.sql.catalyst.expressions.GenericRow.getString(rows.scala:165)
  ... 56 elided 

Solution

  • Spark is an open source project. You can modify the apis if you don't like them. Don't take it negative just because you don't get what you desire. There are plenty of alternatives. And Spark has been made as much flexible as possible.

    Alternatively you can do the followings

    df.first.get(0).toString
    //res0: String = 1
    df.first.get(0).toString.toLong
    //res1: Long = 1
    df.first.get(0).toString.toFloat
    //res2: Float = 1.0
    

    And

    df.first.getAs[Int]("col1").toString
    //res0: String = 1
    df.first.getAs[Int]("col1").toLong
    //res1: Long = 1
    df.first.getAs[Int]("col1").toFloat
    //res2: Float = 1.0
    

    I repeat again, You can always extend the existing apis and implement yours or create your own if you are not satisfied with provided apis