Reading data from csv in spark

Thank you for making time to answer this question.

I was recently working with spark and I read that it considers one partition from HDFS = one partition in spark. With that logic there are many cases where we might not use HDFS as source. So, if we use CSV or any other file-based format to read data from then how the partition is or rather how that data is partitioned since there is no explicit partitioning.

Solution

When you read a CSV file from spark the partitioning is defined by this config spark.sql.files.maxPartitionBytes which is by default according to [the spark documentation][1] 134217728

so for example if you set "spark.sql.files.maxPartitionBytes" ,"1024" and read a CSV file of 1mb you will have 1000 partitions [1]: https://spark.apache.org/docs/latest/sql-performance-tuning.html#other-configuration-options

How to read Parquet file from S3 without spark? Java
spark dataframe sum of column based on condition
Check whether boolean column contains only True values
Pyspark - how to initialize common DataFrameReader options separately?
Open, High, Low, Close, Volume in PySpark using tick data
Dataframe.write() produces csv file on single node jobs cluster, but not on 2+1 nodes cluster
Internals of worker/executor usage during coalesce/repartition
How to run Multi threaded jobs in apache spark using scala or python?
Including null values in an Apache Spark Join
Why is my PySpark row_number column messed up when applying a schema?
Split a datafarme column based on another column - Column is not iterable
Issue with Multiple Spark Structured Streaming Jobs Consuming Same Kafka Topic
Pass system property to spark-submit and read file from classpath or custom path
Pandas cannot read parquet files created in PySpark
conditional split based on list of column
Explode JSON array into rows
How to use pyspark regex to correctly break data with pipe delimited with literal pipe inside?
Meaning of Exchange in Spark Stage
Spark SQL Row_number() PartitionBy Sort Desc
Filtering rows based on column values in Spark dataframe Scala
Spark - load CSV file as DataFrame?
Scala Spark Streaming Via Apache Toree
Apache Spark: how to cancel job in code and kill running tasks?
Access dedicated SQL Pool from Synapse Analytics notebook
How do you avoid sorting when writing partitioned data in Spark on Palantir Foundry?
How to load local csv file in spark via --files option
Is there a way to store a dictionary as a column value in pyspark?
PySpark: compute row maximum of the subset of columns and add to an exisiting dataframe
Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
Counting items in an array and making counts into columns