Search code examples
apache-sparkpyspark

Reading data from csv in spark


Thank you for making time to answer this question.

I was recently working with spark and I read that it considers one partition from HDFS = one partition in spark. With that logic there are many cases where we might not use HDFS as source. So, if we use CSV or any other file-based format to read data from then how the partition is or rather how that data is partitioned since there is no explicit partitioning.


Solution

  • When you read a CSV file from spark the partitioning is defined by this config spark.sql.files.maxPartitionBytes which is by default according to [the spark documentation][1] 134217728

    so for example if you set "spark.sql.files.maxPartitionBytes" ,"1024" and read a CSV file of 1mb you will have 1000 partitions [1]: https://spark.apache.org/docs/latest/sql-performance-tuning.html#other-configuration-options