Search code examples
pythonsortingapache-sparkrddpyspark

PySpark - sortByKey() method to return values from k,v pairs in their original order


I need to be able to return a list of values from (key,value) pairs from an RDD while maintaining original order.

I've included my workaround below but I'd like to be able to do it all in one go.

Something like:

myRDD = [(1, 2582), (3, 3222), (4, 4190), (5, 2502), (6, 2537)]
values = myRDD.<insert PySpark method(s)>
print values
>>>[2582, 3222, 4190, 2502, 2537]

My workaround:

myRDD = [(1, 2582), (3, 3222), (4, 4190), (5, 2502), (6, 2537)]

values = []
for item in myRDD.sortByKey(True).collect():
                 newlist.append(item[1])
print values
>>>[2582, 3222, 4190, 2502, 2537]

Thanks!


Solution

  • If by "original order" you mean order of the keys then all you have to do is add map after the sort:

    myRDD.sortByKey(ascending=True).map(lambda (k, v): v).collect()
    

    or to call values method:

    myRDD.sortByKey(ascending=True).values().collect()
    

    If you refer to the order of the values in a structure which has been used to create initial RDD then it is impossible without storying additional information. RDDs are unordered, unless you explicitly apply transformations like sortBy.