Search code examples
pythonpysparkrdd

How can I group with multiple RDDs without aggregate the original RDD's partition?


I have two RDDs have common variables, which have format like:

 x = sc.parallelize([("A", 1), ("B", 4),("A",2)])
 y = sc.parallelize([("A", -1),("B", 5)])

Then I want to group with them using the common variable. "A" and "B".

I have tried to use the command below:

 z = [(x, tuple(map(list, y))) for x, y in sorted(list(x.cogroup(y).collect()))]
 print(z)

What I got is

[('A', ([1, 2], [-1])), ('B', ([4], [5]))]

However, the thing I want is

[('A', ([1], [-1])), ('B', ([4], [5])),('A', ([2], [-1]))]

How can I change the code to get the output like above? Thank you.


Solution

  • You can do this with a straight join:

    print(x.join(y).collect())
    #[('A', (1, -1)), ('A', (2, -1)), ('B', (4, 5))]
    

    Add in a call to mapValues if you want the elements of the tuples to be lists:

    print(x.join(y).mapValues(lambda a: tuple([b] for b in a)).collect())
    #[('A', ([1], [-1])), ('A', ([2], [-1])), ('B', ([4], [5]))]