Search code examples
scalacsvapache-spark

How to make Spark not to lose last row of CSV file if load from Dataset[String]?


We have the following CSV file:

RWA
Country of exposure
Credit risk asset class
Projection period
Scenario
RWA

RWA in first line is header. The last line is equal to the header, but it is not a header. When the CSV file contents is loaded from Dataset[String] like

import spark.implicits._
val source: Array[String] = (
      "RWA\n" +
      "Country of exposure\n" +
      "Credit risk asset class\n" +
      "Projection period\n" +
      "Scenario\n" +
      "RWA"
      ).split("\n")
val csvData: Dataset[String] = spark.sparkContext.parallelize(source).toDS()
val df = spark.read
  .format("com.databricks.spark.csv")
  .option("header", value = true).csv(csvData)
// df.count() == 4 unexpectedly

Expected df.count() is 5 but it is actually 4.

If the same CSV is loaded from file then this problem doesn't happen:

val tempFile = Files.createTempFile("tmp", ".csv")
val res = "RWA\n" +
      "Country of exposure\n" +
      "Credit risk asset class\n" +
      "Projection period\n" +
      "Scenario\n" +
      "RWA"
Files.writeString(tempFile, res)
val df = spark.read
  .format("com.databricks.spark.csv")
  .option("header", value = true)
  .csv(tempFile.toString)
// df.count() == 5 as expected

Is there a way to tell to Spark act the same when load from Dataset?

Scala version: 2.12.14

Spark version: 3.0.3


Solution

  • It is this function that filters the last row: org.apache.spark.sql.execution.datasources.csv.CSVUtils#filterHeaderLine

    If you set the header option to true, this function will remove all rows equal to the first line which is the header.

    You may wonder why?! Use this script to save a dataframe, and investigate the files: df.repartition(2).write.option("header", true).csv("output.csv")

    Now, try to read the output.csv file. The reason behind this behavior is that when Spark wants to read a csv file, there may be multiple files, so there may be multiple header rows!