Search code examples
apache-sparkpysparkapache-spark-sqldrop-duplicates

Applying PySpark dropDuplicates method messes up the sorting of the data frame


I'm not sure why this is the behaviour, but when I apply dropDuplicates to a sorted data frame, the sorting order is disrupted. See the following two tables in comparison.

The following table is the output of sorted_df.show(), in which the sorting is in order.

+----------+-----------+
|sorted_col|another_col|
+----------+-----------+
|         1|          1|
|         8|          5|
|        15|          1|
|        19|          9|
|        20|          7|
|        27|          9|
|        67|          8|
|        91|          9|
|        91|          7|
|        91|          1|
+----------+-----------+

The following table is the output of sorted_df.dropDuplicates().show(), and the sorting is not right anymore, even though it's the same data frame.

+----------+-----------+
|sorted_col|another_col|
+----------+-----------+
|        27|          9|
|        67|          8|
|        15|          1|
|        91|          7|
|         1|          1|
|        91|          1|
|         8|          5|
|        91|          9|
|        20|          7|
|        19|          9|
+----------+-----------+

Can someone explain why this behaviour persists and how can I keep the same sorting order with dropDuplicates applied?

Apache Spark version 3.1.2


Solution

  • dropDuplicates involves a shuffle. Ordering is therefore disrupted.