I have a dataframe with about 20 columns and 15M rows that I have to sort, based on some conditions. I also prefer not to add new columns to the dataframe, to help setting the order.
For simplicity lets say that I have the following data, where A is an integer and a1, a2 have the value of 0 or 1:
| A| a1| a2|
|:---:|:---:|:---:|
| 3| 0| 0|
| 1| 0| 1|
| 2| 0| 1|
| 1| 1| 0|
| 3| 1| 1|
| 1| 1| 1|
I'd like to sort it by 'A' column first and then on some conditions on 'A1' and 'A2', so I use the following code -
df_sorted = df.orderBy(col('a'), f.when((col('A1') == 1) & (col('A2') == 0), 1)
.when((col('A1') == 0) & (col('A2') == 1), 2)
.when((col('A1') == 1) & (col('A2') == 1), 3)
.when((col('A1') == 0) & (col('A2') == 0), 4))
which gives my the desired results:
A | a1 | a2 |
---|---|---|
1 | 1 | 0 |
1 | 0 | 1 |
1 | 1 | 1 |
2 | 0 | 1 |
3 | 1 | 1 |
3 | 0 | 0 |
The problem is that I have some other group of columns in the dataframe that I have to sort by (B, B1, B2, C, C1, C2 and so on), so I prefer the following method -
sorting_order = [(col("A1") == 1) & (col("A2") == 0),
(col("A1") == 0) & (col("A2") == 1),
(col("A1") == 1) & (col("A2") == 1),
(col("A1") == 0) & (col("A2") == 0)]
df_sorted = df.orderBy(col("A"), *sorting_order)
because I can write down a function that returns all the necessary lists for all the other sortings, but I get wrong result:
| A| a1| a2|
|:---:|:---:|:---:|
| 1| 1| 1|
| 1| 0| 1|
| 1| 1| 0|
| 2| 0| 1|
| 3| 0| 0|
| 3| 1| 1|
It looks like the order of the condition on a1 and a2 is now descending! I guess I can reverse the sorting_order
list to get the right result, but I'd like to know why the result is not as I expected.
Also tried to use df_sorted = df.orderBy([col("A"), *sorting_order], ascending=[1, 0])
but that messed up the output even more:
A | a1 | a2 |
---|---|---|
1 | 1 | 0 |
1 | 1 | 1 |
1 | 0 | 1 |
So my question is - why is the ordering reversed when I am using the list, and is there a way to avoid it without reversing the list?
You are using a list of conditions for ordering the DataFrame, which is equivalent to using a list of sorting columns. The default behavior for ordering in most SQL-based systems, including PySpark, is to sort in ascending order by default. This means that the first condition in the list is sorted in ascending order, and the subsequent conditions are used to resolve ties.
In your case, the first condition in the list is (col("A1") == 1) & (col("A2") == 0). This means that the DataFrame is sorted in ascending order based on this condition. Since it is a boolean condition, rows where this condition is False come first.
To achieve your desired sorting, you can reverse the order of your conditions in the list. Try this code:
sorting_order = [(col("A1") == 0) & (col("A2") == 0),
(col("A1") == 1) & (col("A2") == 0),
(col("A1") == 0) & (col("A2") == 1),
(col("A1") == 1) & (col("A2") == 1)]
df_sorted = df.orderBy(col("A"), *sorting_order)