Search code examples
apache-sparkhadoop-yarnapache-spark-sql

Spark DataFrame losing string data in yarn-client mode


By some reason if I'm adding new column, appending string to existing data/column or creating new DataFrame from code, it misinterpreting string data, so show() doesn't work properly, filters (such as withColumn, where, when, etc.) doesn't work ether.

Here is example code:

object MissingValue {
  def hex(str: String): String = str.getBytes("UTF-8").map(f => Integer.toHexString((f&0xFF)).toUpperCase).mkString("-")

  def main(args: Array[String]) {
    val conf = new SparkConf().setAppName("MissingValue")
    val sc = new SparkContext(conf)
    sc.setLogLevel("WARN")
    val sqlContext = new SQLContext(sc)
    import sqlContext.implicits._

    val list = List((101,"ABC"),(102,"BCD"),(103,"CDE"))
    val rdd = sc.parallelize(list).map(f => Row(f._1,f._2))
    val schema = StructType(StructField("COL1",IntegerType,true)::StructField("COL2",StringType,true)::Nil)
    val df = sqlContext.createDataFrame(rdd,schema)
    df.show()

    val str = df.first().getString(1)
    println(s"${str} == ${hex(str)}")

    sc.stop()
  }
}

If I run it in local mode then everything works as expected:

+----+----+
|COL1|COL2|
+----+----+
| 101| ABC|
| 102| BCD|
| 103| CDE|
+----+----+

ABC == 41-42-43

But when I run the same code in yarn-client mode it produces:

+----+----+
|COL1|COL2|
+----+----+
| 101| ^E^@^@|
| 102| ^E^@^@|
| 103| ^E^@^@|
+----+----+

^E^@^@ == 5-0-0

This problem exists only for string values, so first column (Integer) is fine.

Also if I'm creating rdd from the dataframe then everything is fine i.e. df.rdd.take(1).apply(0).getString(1)

I'm using Spark 1.5.0 from CDH 5.5.2

EDIT: It seems that this happens when the difference between driver memory and executor memory is too high --driver-memory xxG --executor-memory yyG i.e. when I decreasing executor memory or increasing driver memory then the problem disappears.


Solution

  • This is a bug related to executor memory and Oops size:

    https://issues.apache.org/jira/browse/SPARK-9725
    https://issues.apache.org/jira/browse/SPARK-10914
    https://issues.apache.org/jira/browse/SPARK-17706

    It is fixed in Spark version 1.5.2