Search code examples
apache-sparkpysparkrdd

How to make calculations between RDD rows?


I have a Spark RDD like this:

[(1, '02-01-1950', 2.8), (2, '03-01-1950', 3.1), (3, '04-01-1950', 3.2)]

And I want to calculate the increase (by percentage) between sequential rows. For example, from row 1 to row 2 the increase of value is 110.7% ((3.1/2.8)*100), and so on.

Any suggestions on how to make calculations between rows?


Solution

  • You can join the RDD with the same RDD that have the keys shifted by 1:

    rdd = sc.parallelize([(1, '02-01-1950', 2.8), (2, '03-01-1950', 3.1), (3, '04-01-1950', 3.2)])
    
    rdd2 = rdd.map(lambda x: (x[0], x[2]))
    rdd3 = rdd.map(lambda x: (x[0]+1, x[2]))
    
    rdd4 = rdd2.join(rdd3).mapValues(lambda r: r[0]/r[1]*100)
    
    rdd4.collect()
    # [(2, 110.71428571428572), (3, 103.2258064516129)]