Search code examples
pysparkschemadatabrickspartitioning

Pyspark load-csv does not show the real schema of a new file (only the "infered" schema)


I'm trying to load with pyspark csv from a partitionned folder : mnt/data/test/ingestdatetime=20210208/test_20210208.csv

df = spark.read.csv("mnt/data/test")
df = df.filter(df['ingestdatetime'] == '20210208') 

Basically I want to see if the schema is different from what it is supposed to be (the data does not come with headers, so i can not compare headers)

The issue is that, whenever I’m loading the data on top level "data/test/", the schema is “inferred” based on few rows, and it does not see if the new file have additional columns or too many columns. ==> So i am not able to compare if the schema are different.

I see this (6 columns): enter image description here

Instead of this (7 columns): enter image description here

The first way i could do it, is to load the data directly from the partition (data/test/ingestdate=20210208/) . But i would lose the partition key column type.

I guess i could also load everything as strings.


Solution

  • You can use basePath option when reading with PySpark to "keep" the partition column in the output dataframe. This option is well-known but undocumented (or documented only for Parquet but applicable with all other sources )

    spark.read.option("basePath", "/mnt/data/test").csv("/mnt/data/test/ingestdate=20210208")