How to group and count values in RDD to return a small summary using pyspark?

Some example data:

new_data = [{'name': 'Tom', 'subject': "maths", 'exam_score': 85},
            {'name': 'Tom', 'subject': "science", 'exam_score': 55},
            {'name': 'Tom', 'subject': "history", 'exam_score': 68},
            {'name': 'Ivy', 'subject': "maths", 'exam_score': 72},
            {'name': 'Ivy', 'subject': "science", 'exam_score': 67},
            {'name': 'Ivy', 'subject': "history", 'exam_score': 59},
            {'name': 'Ben', 'subject': "maths", 'exam_score': 56},
            {'name': 'Ben', 'subject': "science", 'exam_score': 51},
            {'name': 'Ben', 'subject': "history", 'exam_score': 63},
            {'name': 'Eve', 'subject': "maths", 'exam_score': 74},
            {'name': 'Eve', 'subject': "maths", 'exam_score': 87},
            {'name': 'Eve', 'subject': "maths", 'exam_score': 90}]

new_rdd = sc.parallelize(new_data)

Given that a student passes the exam if they score 60 or more.

I would like to return a Spark RDD which has name of student followed by the number of exams they pass (should be a number between 1 and 3)?

I'm assuming I would have to use groupByKey() and map() here?

The expected output should look something like:

# [('Tom', 2),
# ('Ivy', 2),
# ('Ben', 1),
# ('Eve', 3)]

Solution

You can use filter() for the condition, and then a map() to keep the name as a key and use reduceByKey() to count the occurrences.

data_rdd. \
    filter(lambda r: r['exam_score'] >= 60). \
    map(lambda r: (r['name'], 1)). \
    reduceByKey(lambda x, y: x + y). \
    collect()

# [('Tom', 2), ('Ivy', 2), ('Ben', 1), ('Eve', 3)]