Search code examples
pythonapache-sparkpysparkrdd

How to Srot rdd inner list element in Pyspark?


I have created A Rdd's like below

rdd=sc.parallelize([['A','C','B'], ['D','A','B','C'], ['C','B'],['B']])

I want to sort inner list elements. for example first element inside rdd is ['A','C','B'],but I want to sort like ['A','B','C']

my expected output is:

 [['A','B','C'], ['A','B','C','D'], ['B','C'],['B']]

Solution

  • It is easier and usually more efficient (since spark optimizer works on dataframes whereas you need to optimize rdds yourself) to work with dataframes rather than rdds:

    from pyspark.sql.functions import *
    df=spark.createDataFrame([[['A','C','B']], [['D','A','B','C']], [['C','B']],[['B']]],['l'])
    df.show()
    +------------+
    |           l|
    +------------+
    |   [A, C, B]|
    |[D, A, B, C]|
    |      [C, B]|
    |         [B]|
    +------------+
    
    df.withColumn('l',sort_array('l')).show()
    +------------+
    |           l|
    +------------+
    |   [A, B, C]|
    |[A, B, C, D]|
    |      [B, C]|
    |         [B]|
    +------------+
    

    if you still want an rdd you can always

    rdd=df.withColumn('l',sort_array('l')).rdd