Search code examples
pythonpysparkrdd

How to select data with the oldest time per key in RDD?


I have a RDD with two variables ID and time. The time is in datetime.datetime format. Here is a head scan of the RDD data:

 [[41186, datetime.datetime(2014, 3, 1, 20, 48, 5, 630000)],
 [32036, datetime.datetime(2014, 3, 2, 0, 25, 41, 950000)],
 [41186, datetime.datetime(2014, 3, 2, 0, 31, 29, 380000)],
 [40479, datetime.datetime(2014, 3, 2, 0, 39, 6, 800000)],
 [4598, datetime.datetime(2014, 3, 2, 1, 48, 47, 430000)]]

One ID will appear multiple times in the data file with different date time, and I only want to select each ID with the furthest time.

For example, in the sample data above, I only need to select:

 [[41186, datetime.datetime(2014, 3, 1, 20, 48, 5, 630000)],
 [32036, datetime.datetime(2014, 3, 2, 0, 25, 41, 950000)],
 [40479, datetime.datetime(2014, 3, 2, 0, 39, 6, 800000)],
 [4598, datetime.datetime(2014, 3, 2, 1, 48, 47, 430000)]]

How can I write a query to get this output? Thank you.


Solution

  • Use groupByKey and apply min:

    print(rdd.groupByKey().mapValues(min).collect())
    #[(41186, datetime.datetime(2014, 3, 1, 20, 48, 5, 630000)),
    # (32036, datetime.datetime(2014, 3, 2, 0, 25, 41, 950000)),
    # (4598, datetime.datetime(2014, 3, 2, 1, 48, 47, 430000)),
    # (40479, datetime.datetime(2014, 3, 2, 0, 39, 6, 800000))]