I saved my dataframe as parquet format
df.write.parquet('/my/path')
When checking on HDFS, I can see that there is 10 part-xxx.snappy.parquet files under the parquet directory /my/path
My question is: is one part-xxx.snappy.parquet file correspond to a partition of my dataframe ?
Yes, part-** files are created based on number of partitions
in the dataframe while writing to HDFS.
To check number of partitions
in the dataframe:
df.rdd.getNumPartitions()
To control number of files writing to filesystem we can use .repartition (or) .coalesce() (or) dynamically
based on our requirement.