python apache-spark pyspark databricks databricks-community-edition

how to calculate multiple elements sum and average with RDD

I am a big newbie in pyspark. have organized a RDD with the following code:

rdd1 = labRDD.map(lambda kv: (kv[0].split("/")[-1].split('.')[0], kv[1]))
rdd2 = rdd1.flatMapValues(lambda v: v.split('\r\n'))
rdd3 = rdd2.map(lambda kv: (kv[0], kv[0].split('_')[0], kv[1].split()[0], int(kv[1].split()[1])))

The result is ('town','shop','month','revenue') :

[('anger', 'anger', 'JAN', 13),
 ('marseille', 'marseille_1', 'FEB', 12),
 ('marseille', 'marseille_2', 'MAR', 14),
 ('paris', 'paris_1', 'APR', 15),...]

I am forced not to use dataframe, thus I need RDD results. I have to calculate :

Average monthly income of the shop (all branches/stores) in France
Average monthly income of the shop (all branches) in each city
Total revenue per city per year
Total revenue per store per year
The store that achieves the best performance in each month

Thanks in advance :)

Solution

I've found the answer to the two first ones :)

Total revenue per city per year

annual_city_rev = rdd3.map(lambda t:(t[1], t[3])).reduceByKey(lambda x,y:int(x)+int(y))
annual_city_rev.collect()

Total revenue per store per year

annual_store_revenue = rdd3.map(lambda t:(t[0], t[3])).reduceByKey(lambda x,y: int(x)+int(y))
annual_store_revenue.collect()