Search code examples
apache-sparkpysparkrdd

PySpark how to sort by a value, if the values are equal sort by the key?


For instance:

tmp = [('a', 1), ('e', 1), ('b', 1), ('f', 3), ('d', 4), ('c', 5)]
sc.parallelize(tmp).sortBy(lambda x: x[1]).collect()

# in this way, it only sorts the value.
[('a', 1), ('e', 1), ('b', 1), ('f', 3), ('d', 4), ('c', 5)]

What I want is if the both of the values are equal, then compare the key ('a', 'b', 'c', 'd' ...)

The expected output is :

[('a', 1), ('b', 1), ('e', 1), ('f', 3), ('d', 4), ('c', 5)]

I know it's easy to achieve this by using sortBy twice, firstly sort the key, then sort the value. However, I think it may be not feasible if the dataset is distributed.

Is there any lambda function to settle this?


Solution

  • You can sort by the second element, then the first element:

    sc.parallelize(tmp).sortBy(lambda x: [x[1], x[0]]).collect()
    [('a', 1), ('b', 1), ('e', 1), ('f', 3), ('d', 4), ('c', 5)]