Search code examples
pythonapache-sparkrdd

How to get all the Pokémon with the maximum defense using spark RDD operations?


I have tried to find all the Pokémon with the highest defense value using spark RDD operations, but I am only getting one out of the 3 Pokémon having highest defense values. Is there a way to get all 3 of them using only RDD operations? The Pokémon dataset can be downloaded from Pokemon data. [PS: I need to find a way to get them without knowing that there are 3 of them beforehand].

from pyspark import SparkContext, SparkConf
conf = SparkConf().setAppName("trial").setMaster("local")
sc = SparkContext(conf=conf)
input = "Pokemon.csv"
lineRDD = sc.textFile(input)
poke_def = lineRDD.map(lambda line : tuple(line.split(',')[i] for i in {1,7}) if line.split(',')[0].isdigit() else ('','0'))
poke_def.reduce(lambda x,y: x if int(x[1]) >= int(y[1]) else y)

I have also tried using max function directly instead of reduce, but that too returns only a single Pokémon.

printList(poke_def.max(lambda x: int(x[1])))

Solution

  • I think I did not really understood your question in my other answer. I don't delete it because it can be useful too.

    In case you want to get all pokemons with the highest defense, but without knowing how many they are, you can do that:

    >>> poke_def_int = poke_def.mapValues(int)
    >>> max_defense = poke_def_int.values().max()
    >>> best_defense_pokemonRDD = poke_def_int.filter(lambda x: x[1] == max_defense)
    >>> best_defense_pokemonRDD.collect()
    [('SteelixMega Steelix', '230'), ('Shuckle', '230'), ('AggronMega Aggron', '230')]