apache-spark apache-spark-sql datastax parquet

Spark repartition is not working as expected

I am using spark-sql 2.3.1, I set

spark.sql.shuffle.partitions=40

in my code '

val partitioned_df =  vals_df.repartition(col("model_id"),col("fiscal_year"),col("fiscal_quarter"))

When i say

println(" Number of partitions : " + partitioned_df.rdd.getNumPartitions)

It is giving 40 as output , infact after repartition ideally the count should be around 400 , Why repartition is not working here ? What am I making wrong here? how to fix it ?

Solution

set spark.sql.shuffle.partitions=40

This applies to JOINs and AGGregations only was my understanding.

Try something like this - my own example:

val df2 = df.repartition(40, $"c1", $"c2")

Here is the output of

val df2 = df.repartition(40, $"c1", $"c2").explain 

== Physical Plan ==
Exchange hashpartitioning(c1#114, c2#115, 40)
...

Can set num partitions dynamically:

n = some calculation
val df2 = df.repartition(n, $"c1", $"c2").explain