I have a Spark RDD like this:
[(1, '02-01-1950', 2.8), (2, '03-01-1950', 3.1), (3, '04-01-1950', 3.2)]
And I want to calculate the increase (by percentage) between sequential rows. For example, from row 1 to row 2 the increase of value is 110.7% ((3.1/2.8)*100)
, and so on.
Any suggestions on how to make calculations between rows?
You can join the RDD with the same RDD that have the keys shifted by 1:
rdd = sc.parallelize([(1, '02-01-1950', 2.8), (2, '03-01-1950', 3.1), (3, '04-01-1950', 3.2)])
rdd2 = rdd.map(lambda x: (x[0], x[2]))
rdd3 = rdd.map(lambda x: (x[0]+1, x[2]))
rdd4 = rdd2.join(rdd3).mapValues(lambda r: r[0]/r[1]*100)
rdd4.collect()
# [(2, 110.71428571428572), (3, 103.2258064516129)]