Search code examples
apache-sparkpyspark

distinct on data from multiple executors


When performing the distinct operation in Spark,

  1. Initially each partition computes distinct values based on hashing.
  2. These distinct values are then passed to the driver or another executor for a final computation of distinct values across all partitions.

Question: Where does the second level of distinct computation occur? Does it happen at the executor level or directly at the driver?


Solution

  • Sheer logic should tell you the answer (for a dataframe):

    Image N partitions with col x. Steps are:

    1. For each partition, do a local aggregation for uniqueness.
    2. Shuffle via hashing so as to apply global unqiqueness. I.e. get all A's for col x in the same partition.
    3. For each partition peform the local aggregation again - albeit it is now global.

    No need for Driver local aggregation or single Executor like for order by without grouping column.