I have a .csv file with rows containing string values:
"315700H9VGE9BHU9DK42","""LEGOS s.r.o."", švédsky ""LEGOS bolag med begr.amsvar""","cs","",""
My second field occasionally contains 'strings with enclosed "quote" values':
"""LEGOS s.r.o."", švédsky ""LEGOS bolag med begr.amsvar"""
That when read into a spark dataframe the value would present:
"LEGOS s.r.o.", švédsky "LEGOS bolag med begr.amsvar"
I have tried this, and variations of the commented out options as described in these docs
df = (spark
.read
.format("csv")
.option("header", True)
.option("delimiter", ",")
.option("multiline", True)
.option("escapeQuotes", True)
.option("quote", "\"")
.option("escape", "\"")
# .option("escape", "\\")
# .option("escape", '""')
# .option("escape", "\n")
.schema(raw_schema)
.csv(landing_schema_file)
)
Any ideas?
I'm running on Apache Spark 3.3.0, Scala 2.12
OK, I think I have nailed the problem down as something to do with how Databricks and Spark process that data, rather than reading it.
I am reading a large 3GB .csv file into Databricks using. As per this answer and the answer by @vilabinot:
df = (spark
.read
.format("csv")
.option("header", True)
.option("inferSchema", True)
.option("multiline", True)
.option("escape", "\"")
.csv(file_path_or_directory)
)
One of the fields in this dataset is Country code. To test if the file read was OK, groupby and aggregate on this:
df.groupby("Country").agg(count("*").alias("count"))
Which is where erroneous values were being returned i.e. non-Country codes are returned as groups.
Country | count |
---|---|
PL-10 | 1 |
1 TO 10, AFFIL TOWER, L. H. ROAD, | 1 |
But, If I then try to filter the df to these values:
display(df.filter(col('Country') == "PL-10"))
Nothing is returned. So I think this is in fact something to do with the way that Spark or Databricks is trying to optimise different functions, rather than the data read step itself. A few other tests I tried which replicate this behaviour on groupby and filter were:
For reference, the dataset is the GLEIF LEI index.