Search code examples
apache-sparkapache-spark-sqldatastaxparquet

Spark repartition is not working as expected


I am using spark-sql 2.3.1, I set

spark.sql.shuffle.partitions=40 

in my code '

val partitioned_df =  vals_df.repartition(col("model_id"),col("fiscal_year"),col("fiscal_quarter"))

When i say

println(" Number of partitions : " + partitioned_df.rdd.getNumPartitions)

It is giving 40 as output , infact after repartition ideally the count should be around 400 , Why repartition is not working here ? What am I making wrong here? how to fix it ?


Solution

  • set spark.sql.shuffle.partitions=40
    

    This applies to JOINs and AGGregations only was my understanding.

    Try something like this - my own example:

    val df2 = df.repartition(40, $"c1", $"c2")
    

    Here is the output of

    val df2 = df.repartition(40, $"c1", $"c2").explain 
    
    == Physical Plan ==
    Exchange hashpartitioning(c1#114, c2#115, 40)
    ...
    

    Can set num partitions dynamically:

    n = some calculation
    val df2 = df.repartition(n, $"c1", $"c2").explain