Search code examples
pythonapache-sparkpysparkdatabricksdatabricks-community-edition

how to calculate multiple elements sum and average with RDD


I am a big newbie in pyspark. have organized a RDD with the following code:

rdd1 = labRDD.map(lambda kv: (kv[0].split("/")[-1].split('.')[0], kv[1]))
rdd2 = rdd1.flatMapValues(lambda v: v.split('\r\n'))
rdd3 = rdd2.map(lambda kv: (kv[0], kv[0].split('_')[0], kv[1].split()[0], int(kv[1].split()[1])))

The result is ('town','shop','month','revenue') :

[('anger', 'anger', 'JAN', 13),
 ('marseille', 'marseille_1', 'FEB', 12),
 ('marseille', 'marseille_2', 'MAR', 14),
 ('paris', 'paris_1', 'APR', 15),...]

I am forced not to use dataframe, thus I need RDD results. I have to calculate :

  • Average monthly income of the shop (all branches/stores) in France
  • Average monthly income of the shop (all branches) in each city
  • Total revenue per city per year
  • Total revenue per store per year
  • The store that achieves the best performance in each month

Thanks in advance :)


Solution

  • I've found the answer to the two first ones :)

    Total revenue per city per year

    annual_city_rev = rdd3.map(lambda t:(t[1], t[3])).reduceByKey(lambda x,y:int(x)+int(y))
    annual_city_rev.collect()
    

    Total revenue per store per year

    annual_store_revenue = rdd3.map(lambda t:(t[0], t[3])).reduceByKey(lambda x,y: int(x)+int(y))
    annual_store_revenue.collect()