Search code examples
pythonapache-sparkpysparkrdd

How to convert Pair RDD Tuple key to String key in Pyspark?


I have created rdd like below

rdd=sc.parallelize([('AA', 44),('BB', 53),(('AA', 'Bb'), 23), (('AD', 'AC'), 23),(('AA', 'BB', 'CC'), 2)])

I want convert tuple key to string.

My expected output is like below new_rdd.collect() should give:

[('AA', 44),('BB', 53),('AA,Bb', 23),('AD,AC', 23),('AA,BB,CC',2)]

Solution

  • map over the rdd, and check the key type in each tuple, if the key is string, keep the key, otherwise join the key by ',':

    rdd.map(lambda t: (t[0] if isinstance(t[0], str) else ','.join(t[0]), t[1])).collect()
    # [('AA', 44), ('BB', 53), ('AA,Bb', 23), ('AD,AC', 23), ('AA,BB,CC', 2)]