Search code examples
csvapache-sparkdatabricksspark-csv

Spark - CSV - Write Options - Quotes


Hope everyone is doing well.

While going through the spark csv datasource options for the question I am quite confused on the difference between the various quote related options available.

spark-csv-quote-options

  1. Do we have any detailed differences between them ?
  2. Does any option override the other or they all work together ?

Used the example mentioned in the linked question to understand the differences, but still a little confused. Thank you for all the help.


Solution

  • Let's run through a few examples.

    val df = Seq(
      (1, "aaa"),
      (2, "b,c"),
      (3, "d$e"),
      (4, "f%g")
    ).toDF("x", "y")
    df.show
    
    +---+---+
    |  x|  y|
    +---+---+
    |  1|aaa|
    |  2|b,c|
    |  3|d$e|
    |  4|f%g|
    +---+---+
    

    Whenever you have commas in values that would be indistinguishable from field delimiters (like 2,b,c) Spark can quote such values. It's double quotes by default (like 2,"b,c"), but you can customize it using quote option.

    df.coalesce(1).write.mode("overwrite")
      .option("quote", "$")
      .csv("x")
    
    1,aaa
    2,$b,c$
    3,$d\$e$
    4,f%g
    

    We used dollar sign here. Not only b,c was quoted, but since d$e contained a dollar, it was quoted as well and the dollar itself was escaped.

    I don't know why you would like to do that, but you can ask Spark not to escape fields containing quote character using escapeQuotes option.

    df.coalesce(1).write.mode("overwrite")
      .option("quote", "$")
      .option("escapeQuotes", false)
      .csv("x")
    
    1,aaa
    2,$b,c$
    3,d$e
    4,f%g
    

    Here, d$e remained unquoted.

    You can also force it to quote every field using quoteAll.

    df.coalesce(1).write.mode("overwrite")
      .option("quote", "$")
      .option("quoteAll", true)
      .csv("x1")
    
    $1$,$aaa$
    $2$,$b,c$
    $3$,$d\$e$
    $4$,$f%g$
    

    Note that quoteAll = true makes escapeQuotes irrelevant.

    You can customize the escape character for the quote character inside quote using escape, if you don't like the default backslash.

    df.coalesce(1).write.mode("overwrite")
      .option("quote", "$")
      .option("escape", "%")
      .csv("x1")
    
    1,aaa
    2,$b,c$
    3,$d%$e$
    4,$f%g$
    

    Finally, note that there's a standalone escape character in f%g, so if you would like to escape that, use charToEscapeQuoteEscaping.

    df.coalesce(1).write.mode("overwrite")
      .option("quote", "$")
      .option("escape", "%")
      .option("quoteAll", true)
      .option("charToEscapeQuoteEscaping", "^")
      .csv("x1")
    
    $1$,$aaa$
    $2$,$b,c$
    $3$,$d%$e$
    $4$,$f^%g$
    

    No, you can't go deeper and escape ^...