Search code examples
csvapache-sparkdatabricksdouble-quotes

Parsing double quotes in Databricks Spark


I have a .csv file with rows containing string values:

"315700H9VGE9BHU9DK42","""LEGOS s.r.o."", švédsky ""LEGOS bolag med begr.amsvar""","cs","",""

My second field occasionally contains 'strings with enclosed "quote" values':

"""LEGOS s.r.o."", švédsky ""LEGOS bolag med begr.amsvar"""

That when read into a spark dataframe the value would present:

"LEGOS s.r.o.", švédsky "LEGOS bolag med begr.amsvar"

I have tried this, and variations of the commented out options as described in these docs

df = (spark
    .read
    .format("csv")
    .option("header", True)
    .option("delimiter", ",")
    .option("multiline", True)
    .option("escapeQuotes", True)
    
    .option("quote", "\"")  
    .option("escape", "\"")

    # .option("escape", "\\")
    # .option("escape", '""') 
    # .option("escape", "\n")

    .schema(raw_schema)
    .csv(landing_schema_file)
     )

Any ideas?

I'm running on Apache Spark 3.3.0, Scala 2.12


Solution

  • OK, I think I have nailed the problem down as something to do with how Databricks and Spark process that data, rather than reading it.

    I am reading a large 3GB .csv file into Databricks using. As per this answer and the answer by @vilabinot:

    df = (spark
        .read
        .format("csv")
        .option("header", True)
        .option("inferSchema", True)
        .option("multiline", True)
        .option("escape", "\"")
        .csv(file_path_or_directory)
         )
    

    One of the fields in this dataset is Country code. To test if the file read was OK, groupby and aggregate on this:

    df.groupby("Country").agg(count("*").alias("count"))
    

    Which is where erroneous values were being returned i.e. non-Country codes are returned as groups.

    Country count
    PL-10 1
    1 TO 10, AFFIL TOWER, L. H. ROAD, 1

    But, If I then try to filter the df to these values:

    display(df.filter(col('Country') == "PL-10"))
    

    Nothing is returned. So I think this is in fact something to do with the way that Spark or Databricks is trying to optimise different functions, rather than the data read step itself. A few other tests I tried which replicate this behaviour on groupby and filter were:

    • reading from json version of the data
    • splitting the .csv into smaller files e.g. here

    For reference, the dataset is the GLEIF LEI index.