scala apache-spark spark-streaming spark-structured-streaming aws-databricks

Read CSV with "§" as delimiter using Databricks autoloader

I'm very new to spark streaming and autoloader and had a query on how we might be able to get autoloader to read a text file with "§" as the delimiter. Below I tried reading the file as a CSV.

Tried running below:

val df = spark.readStream.format("cloudFiles")
  .option("cloudFiles.format", "csv")
  .option("delimeter", "§")
  .option("header", "false")
  .schema(schema)
  .load("path-to-the-csv-file")

but it did not work and got this output: Image1

Figured it might be an encoding related issue, so tried running the below:

val df = spark.readStream.format("cloudFiles")
  .option("cloudFiles.format", "csv")
  .option("delimeter", "§")
  .option("encoding", "Cp1252") //ANSI  
  .option("header", "false")
  .schema(schema)
  .load("path-to-the-csv-file")

This time I'm able to see the "§" in the output, but the delimiter still doesn't work as shown here

Please help!

Edit - I have tried replacing "§" with the Unicode equivalent "U+00A7" and it still doesn't work.

Solution

So, it looks like the order of the .option() that we pass into the autoloader matters!

For example, the below works:

val df2 = spark.readStream.format("cloudFiles")
  .option("cloudFiles.format", "csv")
  .option("encoding", "Cp1252") // ANSI
  .option("delimiter", "§") // §
  .option("header", "false")
  .schema(schema)
  .load(file_path)

The below does not:

val df3 = spark.readStream.format("cloudFiles")
  .option("cloudFiles.format", "csv")
  .option("delimeter", "§")
  .option("encoding", "Cp1252") //ANSI  
  .option("header", "false")
  .schema(schema)
  .load(file_path)

The encoding has to be set before we set the delimiter for this to work.