I have the following rdd
[('K', ' M'),
('K', ' H'),
('M', ' K'),
('M', ' E'),
('H', ' F'),
('B', ' T'),
('B', ' H'),
('E', ' K'),
('E', ' H'),
('F', ' K'),
('F', ' H'),
('F', ' E'),
('A', ' Z')]
I want to filter out the elements (x,y) for which (y,x) is present in the rdd. In my example the output should be like:
[(K,M),
(H,F)]
Thanks for help
You can put each tuple in order, count the tuples and then filter out tuples that have appeared more than once:
rdd.groupBy(lambda t: (min(t), max(t)))
.mapValues(len)
.filter(lambda t: t[1] > 1)
.map(lambda t: t[0])
.collect()
# [('F', 'H'), ('K', 'M')]