When performing the distinct operation in Spark,
- Initially each partition computes distinct values based on hashing.
- These distinct values are then passed to the driver or another executor for a final computation of distinct values across all partitions.
Question: Where does the second level of distinct computation occur? Does it happen at the executor level or directly at the driver?
Sheer logic should tell you the answer (for a dataframe):
Image N partitions with col x. Steps are:
- For each partition, do a local aggregation for uniqueness.
- Shuffle via hashing so as to apply global unqiqueness. I.e. get all A's for col x in the same partition.
- For each partition peform the local aggregation again - albeit it is now global.
No need for Driver local aggregation or single Executor like for order by without grouping column.