For instance:
tmp = [('a', 1), ('e', 1), ('b', 1), ('f', 3), ('d', 4), ('c', 5)]
sc.parallelize(tmp).sortBy(lambda x: x[1]).collect()
# in this way, it only sorts the value.
[('a', 1), ('e', 1), ('b', 1), ('f', 3), ('d', 4), ('c', 5)]
What I want is if the both of the values are equal, then compare the key ('a', 'b', 'c', 'd' ...)
The expected output is :
[('a', 1), ('b', 1), ('e', 1), ('f', 3), ('d', 4), ('c', 5)]
I know it's easy to achieve this by using sortBy
twice, firstly sort the key, then sort the value. However, I think it may be not feasible if the dataset is distributed.
Is there any lambda function to settle this?
You can sort by the second element, then the first element:
sc.parallelize(tmp).sortBy(lambda x: [x[1], x[0]]).collect()
[('a', 1), ('b', 1), ('e', 1), ('f', 3), ('d', 4), ('c', 5)]