Reduce key, value pair based on similarity of their value in PySpark

I am a beginner in PySpark.

I want to find the pairs of letters with the same numbers in values and then to find out which pair of letters appear more often.

Here is my data

data = sc.parallelize([('a', 1), ('b', 4), ('c', 10), ('d', 4), ('e', 4), ('f', 1), ('b', 5), ('d', 5)])
data.collect()
[('a', 1), ('b', 4), ('c', 10), ('d', 4), ('e', 4), ('f', 1), ('b', 5), ('d', 5)]

The result I want would look like this:

1: a,f
4: b, d
4: b, e
4: d, e
10: c
5: b, d

I have tried the following:

data1= data.map(lambda y: (y[1], y[0]))
data1.collect()
[(1, 'a'), (4, 'b'), (10, 'c'), (4, 'd'), (4, 'e'), (1, 'f'), ('b', 5), ('d', 5)]

data1.groupByKey().mapValues(list).collect()
[(10, ['c']), (4, ['b', 'd', 'e']), (1, ['a', 'f']), (5, ['b', 'd'])]

As I said I am very new to PySpark and tried to search the command for that but was not successful. Could anyone please help me with this?

Solution

You can use flatMap with python itertools.combinations to get combinations of 2 from the grouped values. Also, prefer using reduceByKey rather than groupByKey:

from itertools import combinations

result = data.map(lambda x: (x[1], [x[0]])) \
    .reduceByKey(lambda a, b: a + b) \
    .flatMap(lambda x: [(x[0], p) for p in combinations(x[1], 2 if (len(x[1]) > 1) else 1)])

result.collect()

#[(1, ('a', 'f')), (10, ('c',)), (4, ('b', 'd')), (4, ('b', 'e')), (4, ('d', 'e')), (5, ('b', 'd'))]

If you want to get None when tuple has only one element, you can use this:

.flatMap(lambda x: [(x[0], p) for p in combinations(x[1] if len(x[1]) > 1 else x[1] + [None], 2)])