Search code examples
apache-sparkdataframepysparkrdd

Spark Dataframe needs to be repartition after filter like RDD?


According so many good resources, it is advisable to re-partition a RDD after filter operation. since, there is a possibility that most of the partitions are now empty. I have a doubt that in case of Data Frames has this been handled in current versions or do we still need to repartition it after a filter operation?


Solution

  • I have a doubt that in case of Data Frames has this been handled in current versions or do we still need to repartition it after a filter operation?

    If you ask if Spark automatically repartitions data the answer is negative (and I hope it won't change in the future)

    According so many good resources, it is advisable to re-partition a RDD after filter operation. since, there is a possibility that most of the partitions are now empty.

    This really depends on two factors:

    • How selective is the filter (what is the expected fraction of the records preserved).
    • What is the distribution of data, in respect to predicate, prior to filter.

    Unless you expect that predicate prunes majority of data or prior distribution will leave significant fraction of partitions empty, costs of repartitioning usually outweigh potential benefits, so the main reason to call repartition is to limit the number of the output files.