Search code examples
apache-sparkpysparkrdd

Rearranging RDD in PySpark


I have an RDD like this

[('a', [('d2', 1), ('d1', 1)]),
 ('addition', [('d2', 1)]),
 ('administrative', [('d1', 1)]),
 ('also', [('d1', 1)])]

I want an output to look like

a#d2:1;d1:1
addition#d2:1
administrative#d1:1
also#d1:1

I was trying to remove brackets first in order to achieve the output

rdd_new.map(lambda x: re.sub('\(|\)', '', str(x)))

Solution

  • You can map each rdd entry to a string using the suitable string methods:

    result = rdd.map(lambda r: r[0] + '#' + ';'.join(['%s:%d' % (i[0], i[1]) for i in r[1]]))
    
    result.collect()
    # ['a#d2:1;d1:1', 'addition#d2:1', 'administrative#d1:1', 'also#d1:1']