I'm very new to spark streaming and autoloader and had a query on how we might be able to get autoloader to read a text file with "§" as the delimiter. Below I tried reading the file as a CSV.
Tried running below:
val df = spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "csv")
.option("delimeter", "§")
.option("header", "false")
.schema(schema)
.load("path-to-the-csv-file")
but it did not work and got this output: Image1
Figured it might be an encoding related issue, so tried running the below:
val df = spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "csv")
.option("delimeter", "§")
.option("encoding", "Cp1252") //ANSI
.option("header", "false")
.schema(schema)
.load("path-to-the-csv-file")
This time I'm able to see the "§" in the output, but the delimiter still doesn't work as shown here
Please help!
Edit - I have tried replacing "§" with the Unicode equivalent "U+00A7" and it still doesn't work.
So, it looks like the order of the .option()
that we pass into the autoloader matters!
For example, the below works:
val df2 = spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "csv")
.option("encoding", "Cp1252") // ANSI
.option("delimiter", "§") // §
.option("header", "false")
.schema(schema)
.load(file_path)
The below does not:
val df3 = spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "csv")
.option("delimeter", "§")
.option("encoding", "Cp1252") //ANSI
.option("header", "false")
.schema(schema)
.load(file_path)
The encoding has to be set before we set the delimiter for this to work.