Search code examples
scalaapache-sparkjava-11spark3

Spark 3.0 is much slower to read json files than Spark 2.4


I have large amount of json files that Spark can read in 36 seconds but Spark 3.0 takes almost 33 minutes to read the same. On closer analysis, looks like Spark 3.0 is choosing different DAG than Spark 2.0. Does anyone have any idea what is going on? Is there any configuration problem with Spark 3.0.

Spark 2.4

scala> spark.time(spark.read.json("/data/20200528"))
Time taken: 19691 ms
res61: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 more fields]

scala> spark.time(res61.count())
Time taken: 7113 ms
res64: Long = 2605349

Spark 3.0

scala> spark.time(spark.read.json("/data/20200528"))
20/06/29 08:06:53 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
Time taken: 849652 ms
res0: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 more fields]

scala> spark.time(res0.count())
Time taken: 8201 ms
res2: Long = 2605349

Here are the details:

enter image description here


Solution

  • As it turns out default behavior of Spark 3.0 has changed - it tries to infer timestamp unless schema is specified and that results into huge amount of text scan. I tried to load the data with inferTimestamp=false time did come close to that of Spark 2.4 but Spark 2.4 still beats Spark 3 by ~3+ sec (may be in acceptable range but question is why?). I have no idea why this behavior was changed but its should have been notified in BOLD letters.

    Spark 2.4

    spark.time(spark.read.option("inferTimestamp","false").json("/data/20200528/").count)
    Time taken: 29706 ms
    res0: Long = 2605349
    
    
    
    spark.time(spark.read.option("inferTimestamp","false").option("prefersDecimal","false").json("/data/20200528/").count)
    Time taken: 31431 ms
    res0: Long = 2605349
    

    Spark 3.0

    spark.time(spark.read.option("inferTimestamp","false").json("/data/20200528/").count)
    Time taken: 32826 ms
    res0: Long = 2605349
     
    spark.time(spark.read.option("inferTimestamp","false").option("prefersDecimal","false").json("/data/20200528/").count)
    Time taken: 34011 ms
    res0: Long = 2605349
    

    Note:

    • Make sure you never turn on prefersDecimal to true even when inferTimestamp is false, it again takes huge amount of time.
    • Spark 3.0 + JDK 11 is slower than Spark 3.0 + JDK 8 by almost 6 sec.