Search code examples
scalaapache-sparkspark-streamingspark-structured-streamingaws-databricks

Read CSV with "§" as delimiter using Databricks autoloader


I'm very new to spark streaming and autoloader and had a query on how we might be able to get autoloader to read a text file with "§" as the delimiter. Below I tried reading the file as a CSV.

Tried running below:

val df = spark.readStream.format("cloudFiles")
  .option("cloudFiles.format", "csv")
  .option("delimeter", "§")
  .option("header", "false")
  .schema(schema)
  .load("path-to-the-csv-file")

but it did not work and got this output: Image1

Figured it might be an encoding related issue, so tried running the below:

val df = spark.readStream.format("cloudFiles")
  .option("cloudFiles.format", "csv")
  .option("delimeter", "§")
  .option("encoding", "Cp1252") //ANSI  
  .option("header", "false")
  .schema(schema)
  .load("path-to-the-csv-file")

This time I'm able to see the "§" in the output, but the delimiter still doesn't work as shown here

Please help!

Edit - I have tried replacing "§" with the Unicode equivalent "U+00A7" and it still doesn't work.


Solution

  • So, it looks like the order of the .option() that we pass into the autoloader matters!

    For example, the below works:

    val df2 = spark.readStream.format("cloudFiles")
      .option("cloudFiles.format", "csv")
      .option("encoding", "Cp1252") // ANSI
      .option("delimiter", "§") // §
      .option("header", "false")
      .schema(schema)
      .load(file_path)
    

    The below does not:

    val df3 = spark.readStream.format("cloudFiles")
      .option("cloudFiles.format", "csv")
      .option("delimeter", "§")
      .option("encoding", "Cp1252") //ANSI  
      .option("header", "false")
      .schema(schema)
      .load(file_path)
    

    The encoding has to be set before we set the delimiter for this to work.