Search code examples
pythonapache-sparkpysparkfilterrdd

How to group and count values in RDD to return a small summary using pyspark?


Some example data:

new_data = [{'name': 'Tom', 'subject': "maths", 'exam_score': 85},
            {'name': 'Tom', 'subject': "science", 'exam_score': 55},
            {'name': 'Tom', 'subject': "history", 'exam_score': 68},
            {'name': 'Ivy', 'subject': "maths", 'exam_score': 72},
            {'name': 'Ivy', 'subject': "science", 'exam_score': 67},
            {'name': 'Ivy', 'subject': "history", 'exam_score': 59},
            {'name': 'Ben', 'subject': "maths", 'exam_score': 56},
            {'name': 'Ben', 'subject': "science", 'exam_score': 51},
            {'name': 'Ben', 'subject': "history", 'exam_score': 63},
            {'name': 'Eve', 'subject': "maths", 'exam_score': 74},
            {'name': 'Eve', 'subject': "maths", 'exam_score': 87},
            {'name': 'Eve', 'subject': "maths", 'exam_score': 90}]

new_rdd = sc.parallelize(new_data)

Given that a student passes the exam if they score 60 or more.

I would like to return a Spark RDD which has name of student followed by the number of exams they pass (should be a number between 1 and 3)?

I'm assuming I would have to use groupByKey() and map() here?

The expected output should look something like:

# [('Tom', 2),
# ('Ivy', 2),
# ('Ben', 1),
# ('Eve', 3)]

Solution

  • You can use filter() for the condition, and then a map() to keep the name as a key and use reduceByKey() to count the occurrences.

    data_rdd. \
        filter(lambda r: r['exam_score'] >= 60). \
        map(lambda r: (r['name'], 1)). \
        reduceByKey(lambda x, y: x + y). \
        collect()
    
    # [('Tom', 2), ('Ivy', 2), ('Ben', 1), ('Eve', 3)]