Search code examples
pythonapache-sparkpysparkrdd

Pyspark. Getting only minimal values


I want to get only minimum values.

import pyspark as ps

spark = ps.sql.SparkSession.builder.master('local[4]')\
    .appName('some-name-here').getOrCreate()

sc = spark.sparkContext

sc.textFile('path-to.csv')\
    .map(lambda x: x.replace('"', '').split(','))\
    .filter(lambda x: not x[0].startswith('player_id'))\
    .map(lambda x: (x[2] + " " + x[1], int(x[8]) if x[8] else 0))\
    .reduceByKey(lambda value1, value2: value1 + value2)\
    .sortBy(lambda price: price[1], ascending=True).collect()

This is what I get:

[('Cedric Ceballos', 0), ('Maurcie Cheeks', 0), ('James Foster', 0), ('Billy Gabor', 0), ('Julius Keye', 0), ('Anthony Mason', 0), ('Chuck Noble', 0), ('Theo Ratliff', 0), ('Austin Carr', 0), ('Mark Eaton', 0), ('A.C. Green', 0), ('Darrall Imhoff', 0), ('John Johnson', 0), ('Neil Johnson', 0), ('Jim King', 0), ('Max Zaslofsky', 1), ('Don Barksdale', 1), ('Curtis Rowe', 1), ('Caron Butler', 2), ('Chris Gatling', 2)].

As you can see there are a lot of keys with value 0, which is the minimum value. How can I sort it?


Solution

  • You can collect the minimum value into a variable, and do an equality filter based on that variable:

    rdd = sc.textFile('path-to.csv')\
        .map(lambda x: x.replace('"', '').split(','))\
        .filter(lambda x: not x[0].startswith('player_id'))\
        .map(lambda x: (x[2] + " " + x[1], int(x[8]) if x[8] else 0))\
        .reduceByKey(lambda value1, value2: value1 + value2)\
        .sortBy(lambda price: price[1], ascending=True)
    
    minval = rdd.take(1)[0][1]
    rdd2 = rdd.filter(lambda x: x[1] == minval)