Search code examples
dataframescalaapache-sparkdatabricks

Handle double quote while exporting dataframe to CSV


I have a dataframe which contains double quote (") and comma in value. I am trying to export the dataframe into csv but unfortunately double quote is not displayed properly in exported csv.

I managed to handle all other special characters by setting "quoteAll" to true.

In csv, if you replace the single double quote with two double quotes, it works fine. But when I export to csv with below code, it replaces " with " in exported csv.

%scala
import org.apache.spark.sql.functions.regexp_replace
val df = Seq((1, "A,B,C,\"DEF\""), (2, "DEF")).toDF("ID", "Val")

val updatedDf = df.columns.foldLeft(df)((acc, colname) => acc.withColumn(colname,regexp_replace(acc(s"`$colname`"), "\"", "\"\"")))


deltaDS.coalesce(1).write
.option("header", true)
.option("encoding", "utf-8")
.option("quoteAll", true)
.mode("Overwrite").csv("[Location to store csv]")

Output:

enter image description here

Expected Output:

enter image description here

How can I handle this ?


Solution

  • You haven't provided exact plaintext csv you expect, so it's hard to guess. Are you looking for this?

    val df = Seq((1, "A,B,C,\"DEF\""), (2, "DEF")).toDF("ID", "Val")
    
    df.coalesce(1).write
    .option("header", true)
    .option("encoding", "utf-8")
    .option("quoteAll", true)
    .option("escape", "\"")  // escapes quotes inside quote using "
    .mode("Overwrite").csv("xxx")
    

    Resulting csv:

    "ID","Val"
    "1","A,B,C,""DEF"""
    "2","DEF"