Search code examples
apache-sparkpyspark

Does collect() pull the dataframe to the driver before performing a calculation?


I understand that collect pulls the dataframe to the driver node, but if I were to do the following code, would Spark calculate the max date on the worker nodes and then collect the result to the driver, or would it collect the dataframe to the driver and then calculate the max date?

df.select(max("date")).collect()[0][0]

My cluster has crashed a few times because of the above code. I thought that the max date would be calculated on the workers before collecting the result, but given that the cluster is crashing, I'm wondering if it is actually the other way around.


Solution

  • max would be performed on each Worker (map). Known as partial_max. Then shuffle of all those max results sent to a single partition to decide the max of the maxes (reduce).

    Thus, working in distributed fashion. That does not need a collect to be used to occur. See Best way to get the max value in a Spark dataframe column. Also see https://medium.com/geekculture/finding-the-latest-date-is-not-as-easy-as-you-would-think-2d6a0a49eda1