I am a beginner in PySpark.
I want to find the pairs of letters with the same numbers in values and then to find out which pair of letters appear more often.
Here is my data
data = sc.parallelize([('a', 1), ('b', 4), ('c', 10), ('d', 4), ('e', 4), ('f', 1), ('b', 5), ('d', 5)])
data.collect()
[('a', 1), ('b', 4), ('c', 10), ('d', 4), ('e', 4), ('f', 1), ('b', 5), ('d', 5)]
The result I want would look like this:
1: a,f
4: b, d
4: b, e
4: d, e
10: c
5: b, d
I have tried the following:
data1= data.map(lambda y: (y[1], y[0]))
data1.collect()
[(1, 'a'), (4, 'b'), (10, 'c'), (4, 'd'), (4, 'e'), (1, 'f'), ('b', 5), ('d', 5)]
data1.groupByKey().mapValues(list).collect()
[(10, ['c']), (4, ['b', 'd', 'e']), (1, ['a', 'f']), (5, ['b', 'd'])]
As I said I am very new to PySpark and tried to search the command for that but was not successful. Could anyone please help me with this?
You can use flatMap
with python itertools.combinations
to get combinations of 2 from the grouped values. Also, prefer using reduceByKey
rather than groupByKey
:
from itertools import combinations
result = data.map(lambda x: (x[1], [x[0]])) \
.reduceByKey(lambda a, b: a + b) \
.flatMap(lambda x: [(x[0], p) for p in combinations(x[1], 2 if (len(x[1]) > 1) else 1)])
result.collect()
#[(1, ('a', 'f')), (10, ('c',)), (4, ('b', 'd')), (4, ('b', 'e')), (4, ('d', 'e')), (5, ('b', 'd'))]
If you want to get None when tuple has only one element, you can use this:
.flatMap(lambda x: [(x[0], p) for p in combinations(x[1] if len(x[1]) > 1 else x[1] + [None], 2)])