Search code examples
pysparkrdd

select elements from rdd where for (x,y), (y,x) is present in the rdd


I have the following rdd

[('K', ' M'),
 ('K', ' H'),
 ('M', ' K'),
 ('M', ' E'),
 ('H', ' F'),
 ('B', ' T'),
 ('B', ' H'),
 ('E', ' K'),
 ('E', ' H'),
 ('F', ' K'),
 ('F', ' H'),
 ('F', ' E'),
 ('A', ' Z')]

I want to filter out the elements (x,y) for which (y,x) is present in the rdd. In my example the output should be like:

[(K,M),
 (H,F)]

Thanks for help


Solution

  • You can put each tuple in order, count the tuples and then filter out tuples that have appeared more than once:

    rdd.groupBy(lambda t: (min(t), max(t)))
       .mapValues(len)
       .filter(lambda t: t[1] > 1)
       .map(lambda t: t[0])
       .collect()
    
    # [('F', 'H'), ('K', 'M')]