Search code examples
apache-sparkpysparkrddkey-value

Reduce key, value pair based on similarity of their value in PySpark


I am a beginner in PySpark.

I want to find the pairs of letters with the same numbers in values and then to find out which pair of letters appear more often.

Here is my data

data = sc.parallelize([('a', 1), ('b', 4), ('c', 10), ('d', 4), ('e', 4), ('f', 1), ('b', 5), ('d', 5)])
data.collect()
[('a', 1), ('b', 4), ('c', 10), ('d', 4), ('e', 4), ('f', 1), ('b', 5), ('d', 5)]

The result I want would look like this:

1: a,f
4: b, d
4: b, e
4: d, e
10: c
5: b, d

I have tried the following:

data1= data.map(lambda y: (y[1], y[0]))
data1.collect()
[(1, 'a'), (4, 'b'), (10, 'c'), (4, 'd'), (4, 'e'), (1, 'f'), ('b', 5), ('d', 5)]

data1.groupByKey().mapValues(list).collect()
[(10, ['c']), (4, ['b', 'd', 'e']), (1, ['a', 'f']), (5, ['b', 'd'])]

As I said I am very new to PySpark and tried to search the command for that but was not successful. Could anyone please help me with this?


Solution

  • You can use flatMap with python itertools.combinations to get combinations of 2 from the grouped values. Also, prefer using reduceByKey rather than groupByKey:

    from itertools import combinations
    
    result = data.map(lambda x: (x[1], [x[0]])) \
        .reduceByKey(lambda a, b: a + b) \
        .flatMap(lambda x: [(x[0], p) for p in combinations(x[1], 2 if (len(x[1]) > 1) else 1)])
    
    result.collect()
    
    #[(1, ('a', 'f')), (10, ('c',)), (4, ('b', 'd')), (4, ('b', 'e')), (4, ('d', 'e')), (5, ('b', 'd'))]
    

    If you want to get None when tuple has only one element, you can use this:

    .flatMap(lambda x: [(x[0], p) for p in combinations(x[1] if len(x[1]) > 1 else x[1] + [None], 2)])