We have the following CSV file:
RWA
Country of exposure
Credit risk asset class
Projection period
Scenario
RWA
RWA in first line is header. The last line is equal to the header, but it is not a header. When the CSV file contents is loaded from Dataset[String] like
import spark.implicits._
val source: Array[String] = (
"RWA\n" +
"Country of exposure\n" +
"Credit risk asset class\n" +
"Projection period\n" +
"Scenario\n" +
"RWA"
).split("\n")
val csvData: Dataset[String] = spark.sparkContext.parallelize(source).toDS()
val df = spark.read
.format("com.databricks.spark.csv")
.option("header", value = true).csv(csvData)
// df.count() == 4 unexpectedly
Expected df.count() is 5 but it is actually 4.
If the same CSV is loaded from file then this problem doesn't happen:
val tempFile = Files.createTempFile("tmp", ".csv")
val res = "RWA\n" +
"Country of exposure\n" +
"Credit risk asset class\n" +
"Projection period\n" +
"Scenario\n" +
"RWA"
Files.writeString(tempFile, res)
val df = spark.read
.format("com.databricks.spark.csv")
.option("header", value = true)
.csv(tempFile.toString)
// df.count() == 5 as expected
Is there a way to tell to Spark act the same when load from Dataset?
Scala version: 2.12.14
Spark version: 3.0.3
It is this function that filters the last row: org.apache.spark.sql.execution.datasources.csv.CSVUtils#filterHeaderLine
If you set the header
option to true
, this function will remove all rows equal to the first line which is the header.
You may wonder why?! Use this script to save a dataframe, and investigate the files:
df.repartition(2).write.option("header", true).csv("output.csv")
Now, try to read the output.csv
file. The reason behind this behavior is that when Spark wants to read a csv file, there may be multiple files, so there may be multiple header rows!