csv apache-spark databricks double-quotes

Parsing double quotes in Databricks Spark

I have a .csv file with rows containing string values:

"315700H9VGE9BHU9DK42","""LEGOS s.r.o."", švédsky ""LEGOS bolag med begr.amsvar""","cs","",""

My second field occasionally contains 'strings with enclosed "quote" values':

"""LEGOS s.r.o."", švédsky ""LEGOS bolag med begr.amsvar"""

That when read into a spark dataframe the value would present:

"LEGOS s.r.o.", švédsky "LEGOS bolag med begr.amsvar"

I have tried this, and variations of the commented out options as described in these docs

df = (spark
    .read
    .format("csv")
    .option("header", True)
    .option("delimiter", ",")
    .option("multiline", True)
    .option("escapeQuotes", True)
    
    .option("quote", "\"")  
    .option("escape", "\"")

    # .option("escape", "\\")
    # .option("escape", '""') 
    # .option("escape", "\n")

    .schema(raw_schema)
    .csv(landing_schema_file)
     )

Any ideas?

I'm running on Apache Spark 3.3.0, Scala 2.12

Solution

OK, I think I have nailed the problem down as something to do with how Databricks and Spark process that data, rather than reading it.

I am reading a large 3GB .csv file into Databricks using. As per this answer and the answer by @vilabinot:

df = (spark
    .read
    .format("csv")
    .option("header", True)
    .option("inferSchema", True)
    .option("multiline", True)
    .option("escape", "\"")
    .csv(file_path_or_directory)
     )

One of the fields in this dataset is Country code. To test if the file read was OK, groupby and aggregate on this:

df.groupby("Country").agg(count("*").alias("count"))

Which is where erroneous values were being returned i.e. non-Country codes are returned as groups.

Country	count
PL-10	1
1 TO 10, AFFIL TOWER, L. H. ROAD,	1

But, If I then try to filter the df to these values:

display(df.filter(col('Country') == "PL-10"))

Nothing is returned. So I think this is in fact something to do with the way that Spark or Databricks is trying to optimise different functions, rather than the data read step itself. A few other tests I tried which replicate this behaviour on groupby and filter were:

reading from json version of the data
splitting the .csv into smaller files e.g. here

For reference, the dataset is the GLEIF LEI index.