Parsing double quotes in Databricks Spark

I have a .csv file with rows containing string values:

"315700H9VGE9BHU9DK42","""LEGOS s.r.o."", švédsky ""LEGOS bolag med begr.amsvar""","cs","",""

My second field occasionally contains 'strings with enclosed "quote" values':

"""LEGOS s.r.o."", švédsky ""LEGOS bolag med begr.amsvar"""

That when read into a spark dataframe the value would present:

"LEGOS s.r.o.", švédsky "LEGOS bolag med begr.amsvar"

I have tried this, and variations of the commented out options as described in these docs

df = (spark
    .option("header", True)
    .option("delimiter", ",")
    .option("multiline", True)
    .option("escapeQuotes", True)
    .option("quote", "\"")  
    .option("escape", "\"")

    # .option("escape", "\\")
    # .option("escape", '""') 
    # .option("escape", "\n")


Any ideas?

I'm running on Apache Spark 3.3.0, Scala 2.12


  • OK, I think I have nailed the problem down as something to do with how Databricks and Spark process that data, rather than reading it.

    I am reading a large 3GB .csv file into Databricks using. As per this answer and the answer by @vilabinot:

    df = (spark
        .option("header", True)
        .option("inferSchema", True)
        .option("multiline", True)
        .option("escape", "\"")

    One of the fields in this dataset is Country code. To test if the file read was OK, groupby and aggregate on this:


    Which is where erroneous values were being returned i.e. non-Country codes are returned as groups.

    Country count
    PL-10 1
    1 TO 10, AFFIL TOWER, L. H. ROAD, 1

    But, If I then try to filter the df to these values:

    display(df.filter(col('Country') == "PL-10"))

    Nothing is returned. So I think this is in fact something to do with the way that Spark or Databricks is trying to optimise different functions, rather than the data read step itself. A few other tests I tried which replicate this behaviour on groupby and filter were:

    • reading from json version of the data
    • splitting the .csv into smaller files e.g. here

    For reference, the dataset is the GLEIF LEI index.