According so many good resources, it is advisable to re-partition a RDD after filter operation. since, there is a possibility that most of the partitions are now empty. I have a doubt that in case of Data Frames has this been handled in current versions or do we still need to repartition it after a filter operation?
I have a doubt that in case of Data Frames has this been handled in current versions or do we still need to repartition it after a filter operation?
If you ask if Spark automatically repartitions data the answer is negative (and I hope it won't change in the future)
According so many good resources, it is advisable to re-partition a RDD after filter operation. since, there is a possibility that most of the partitions are now empty.
This really depends on two factors:
Unless you expect that predicate prunes majority of data or prior distribution will leave significant fraction of partitions empty, costs of repartitioning usually outweigh potential benefits, so the main reason to call repartition
is to limit the number of the output files.