Search code examples
apache-sparkpysparkrdd

Distribute key to all values in pyspark rdd


I have an rdd of the form:

[(1111, [(0, 174, 12.44, 3.125, u'c29'), (0, 175, 12.48, 6.125, u'c59')], 2222, [(0, 178, 19.41, 2.165, u'c79'), (0, 171, 18.41, 3.125, u'c41')]]

how can I flatten the intermediary list and obtain the rdd as a list of tuples, where each tuple contains the corresponding key and its values, like so:

[(1111, 0, 174, 12.44, 3.125, u'c29'), (1111, 0, 175, 12.48, 6.125, u'c59'), (2222, 0, 178, 19.41, 2.165, u'c79'), (2222, 0, 171, 18.41, 3.125, u'c41')]

Solution

  • Just flatMap:

    rdd.flatMap(lambda x: [(x[0], ) + y for y in x[1]])