I have an rdd in a key-value pair form, with a centroid as the key and all the nearest points to them as the values within a list.
data = [('d1',
[(4.832, 1.963),
(5.439, 2.147),
(5.009, 2.522)]),
('d2',
[(4.26, 2.033),
(5.24, 1.642),
(4.814, 2.033)]),
('d3',
[(4.646, 1.827),
(5.137, 1.858),
(5.288, 1.842)])]
I am trying to calculate the average of all x and y coordinates separately for each centroid by key. I am looking to generate the output as below
[('d1',(5.09, 2.21)),
('d2',(4.77, 1.9)),
('d3',(5.02, 1.84))]
I have tried the following code but i am not getting any result.
data.reduceByKey(lambda x,y: mean(x[1],y[1])).collect()
I am kinda stuck here and would really appreciate some help on this.
You don't need to reduce by key because the data is already grouped by key. You just need to calculate the mean for each entry, using numpy.mean
for example.
import numpy as np
avg_data = data.map(lambda r: (r[0], tuple(np.mean(r[1], axis=0))))
avg_data.collect()
# [('d1', (5.093333333333334, 2.2106666666666666)),
# ('d2', (4.771333333333334, 1.9026666666666667)),
# ('d3', (5.023666666666666, 1.8423333333333334))]