I'm trying to load with pyspark csv from a partitionned folder : mnt/data/test/ingestdatetime=20210208/test_20210208.csv
df = spark.read.csv("mnt/data/test")
df = df.filter(df['ingestdatetime'] == '20210208')
Basically I want to see if the schema is different from what it is supposed to be (the data does not come with headers, so i can not compare headers)
The issue is that, whenever I’m loading the data on top level "data/test/", the schema is “inferred” based on few rows, and it does not see if the new file have additional columns or too many columns. ==> So i am not able to compare if the schema are different.
The first way i could do it, is to load the data directly from the partition (data/test/ingestdate=20210208/) . But i would lose the partition key column type.
I guess i could also load everything as strings.
You can use basePath
option when reading with PySpark to "keep" the partition column in the output dataframe. This option is well-known but undocumented (or documented only for Parquet but applicable with all other sources )
spark.read.option("basePath", "/mnt/data/test").csv("/mnt/data/test/ingestdate=20210208")