Search code examples
pythonpysparkrddreduce

Reducing by (K,V) pairs and sort by V


I'm super new to pyspark and RDDs. Apologies if this question is very rudimentary.

I have mapped and cleaned by data using the following code:

delay = datasplit.map(lambda x: ((x[33], x[8], x[9]))).filter(lambda x: x[0]!= u'0.00').filter(lambda x: x[0]!= '')

but now I need to somehow convert into the following output:

(124, u'"OO""N908SW"')
(432, u'"DL""N810NW"')

where the first is a sum of x[33] mentioned above when grouped by a combination of x[8] and x[9]

I've completed the mapping and get the below output (which is close)

lines = delay.map(lambda x: (float(x[0]), [x[1], x[2]]))

Output:

[(-10.0, [u'OO', u'N908SW']),(62, [u'DL', u'N810NW]), (-6.0, [u'WN', w'N7811F'])]

but I can't figure out how to reduce or combine x[1] and x[2] to create the output shown above.

Thanks in advance.


Solution

  • As a general rule of thumb, you want as little python operations as possible.

    I reduced your code to one map and one reduce.

    import operator
    
    delay_sum = datasplit\
        .map(lambda x: (x[8]+x[9], float(x[33]) if any(x[33]) else 0.0))\
        .reduceByKey(operator.add)
    

    And it goes without saying, that these kind of operations usually run faster when using spark dataframes.