apache-spark apache-spark-sql databricks spark-csv

count throws java.lang.NumberFormatException: null on the file loaded from object store with inferSchema enabled

The count() on a dataframe loaded from IBM Blue mix object storage throws the following exception when inferSchema is enabled:

Name: org.apache.spark.SparkException
Message: Job aborted due to stage failure: Task 3 in stage 43.0 failed 10 times, most recent failure: Lost task 3.9 in stage 43.0 (TID 166, yp-spark-dal09-env5-0034): java.lang.NumberFormatException: null
    at java.lang.Integer.parseInt(Integer.java:554)
    at java.lang.Integer.parseInt(Integer.java:627)
    at scala.collection.immutable.StringLike$class.toInt(StringLike.scala:272)
    at scala.collection.immutable.StringOps.toInt(StringOps.scala:29)
    at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:241)
    at org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:116)
    at org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:85)
    at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:128)
    at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:127)
    at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
    at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
    at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)

I don't get the above exception if I disable the inferSchema. Why am I getting this exception? by default, how many rows are read by databricks if inferSchema is enabled?

Solution

This was actually an issue with the spark-csv package (null value still not correctly parsed #192) that was dragged into spark 2.0. It has been corrected and pushed in spark 2.1.

Here is the associated PR : [SPARK-18269][SQL] CSV datasource should read null properly when schema is lager than parsed tokens.

Since you are already using spark 2.0 you can easily upgrade to 2.1 and drop that spark-csv package. It's not needed anyway.