apache-spark hadoop apache-spark-sql hdfs partitioning

Does dataFrameWriter partitionBy shuffle the data?

I have data partitioned in one way, I just want to partition it in another. So it basically gonna be something like this:

sqlContext.read().parquet("...").write().partitionBy("...").parquet("...")

I wonder does this will trigger shuffle or all data will be re-partition locally, because in this context a partition means just a directory in HDFS and data from the same partition doesn't have to be on the same node to be written in the same dir in HDFS.

Solution

Neither partitionBy nor bucketBy shuffles the data. There are cases though, when repartitioning data first can be a good idea:

df.repartition(...).write.partitionBy(...)

Otherwise the number of the output files is bounded by number of partitions * cardinality of the partitioning column.

How to disable all logging info to the spark console from .net application
how to avoid row number in read_sql output
Spark: fill spec value between flag values
How to read Parquet file from S3 without spark? Java
spark dataframe sum of column based on condition
Check whether boolean column contains only True values
Pyspark - how to initialize common DataFrameReader options separately?
Open, High, Low, Close, Volume in PySpark using tick data
Dataframe.write() produces csv file on single node jobs cluster, but not on 2+1 nodes cluster
Internals of worker/executor usage during coalesce/repartition
How to run Multi threaded jobs in apache spark using scala or python?
Including null values in an Apache Spark Join
Why is my PySpark row_number column messed up when applying a schema?
Split a datafarme column based on another column - Column is not iterable
Issue with Multiple Spark Structured Streaming Jobs Consuming Same Kafka Topic
Pass system property to spark-submit and read file from classpath or custom path
Pandas cannot read parquet files created in PySpark
conditional split based on list of column
Explode JSON array into rows
How to use pyspark regex to correctly break data with pipe delimited with literal pipe inside?
Meaning of Exchange in Spark Stage
Spark SQL Row_number() PartitionBy Sort Desc
Filtering rows based on column values in Spark dataframe Scala
Spark - load CSV file as DataFrame?
Scala Spark Streaming Via Apache Toree
Apache Spark: how to cancel job in code and kill running tasks?
Access dedicated SQL Pool from Synapse Analytics notebook
How do you avoid sorting when writing partitioned data in Spark on Palantir Foundry?
How to load local csv file in spark via --files option
Is there a way to store a dictionary as a column value in pyspark?