Search code examples
python-3.xpysparklambda

Lambda Function to Sum Tuple Values


I have a bunch of tuples that look like the following (this is filtered which I refer to later):

(10, 20)
(20, 40)
(30, 60)

I want to be able to sum the above tuples and produce one value (10 + 20 + 30) / (20 + 40 + 60) = 0.5. I am trying to write lambda functions in pyspark to accomplish this, but I am not mapping them right. I currently have

totals = filtered.map(x, y: sum(x), sum(y))

but this doesn't seem to be working out so well. Is there something else I should be doing to accomplish this?


Solution

  • Your RDD (filtered)

    (10, 20)
    (20, 40)
    (30, 60)
    

    Note: map() iterates through each row, therefore fetch each column values of a row in a variable and perform operation on top of it.

    col_a_sum = filtered.map(lambda row: row[0]).sum()
    col_b_sum = filtered.map(lambda row: row[1]).sum()
    
    total = col_a_sum/col_b_sum
    print(total)
    

    Output:

    0.5