I have a bunch of tuples that look like the following (this is filtered
which I refer to later):
(10, 20)
(20, 40)
(30, 60)
I want to be able to sum the above tuples and produce one value (10 + 20 + 30) / (20 + 40 + 60) = 0.5
. I am trying to write lambda functions in pyspark to accomplish this, but I am not mapping them right. I currently have
totals = filtered.map(x, y: sum(x), sum(y))
but this doesn't seem to be working out so well. Is there something else I should be doing to accomplish this?
Your RDD (filtered)
(10, 20)
(20, 40)
(30, 60)
Note: map()
iterates through each row, therefore fetch each column values of a row in a variable and perform operation on top of it.
col_a_sum = filtered.map(lambda row: row[0]).sum()
col_b_sum = filtered.map(lambda row: row[1]).sum()
total = col_a_sum/col_b_sum
print(total)
Output:
0.5