Search code examples
apache-sparkrdd

Spark RDD find ratio of for key-value pairs


My rdd contains key-value pairs such as this:

(key1, 5),
(key2, 10),
(key3, 20),

I want to perform a map operation that associates each key with its respect ratio in the entire rdd, such as this:

(key1, 5/35),
(key2, 10/35),
(key3, 20/35),

I am struggling to find a method to do this using standard functions, any help will be appreciated.


Solution

  • You can calculate the sum and divide each value by the sum:

    from operator import add
    
    rdd = sc.parallelize([('key1', 5), ('key2', 10), ('key3', 20)])
    total = rdd.values().reduce(add)
    rdd2 = rdd.mapValues(lambda x: x/total)
    
    rdd2.collect()
    # [('key1', 0.14285714285714285), ('key2', 0.2857142857142857), ('key3', 0.5714285714285714)]
    

    In Scala it would be

    val rdd = sc.parallelize(List(("key1", 5), ("key2", 10), ("key3", 20)))
    val total = rdd.values.reduce(_+_)
    val rdd2 = rdd.mapValues(1.0*_/total)
    
    rdd2.collect
    // Array[(String, Double)] = Array((key1,0.14285714285714285), (key2,0.2857142857142857), (key3,0.5714285714285714))