I have a RDD with two variables ID
and time
. The time
is in datetime.datetime
format. Here is a head scan of the RDD data:
[[41186, datetime.datetime(2014, 3, 1, 20, 48, 5, 630000)],
[32036, datetime.datetime(2014, 3, 2, 0, 25, 41, 950000)],
[41186, datetime.datetime(2014, 3, 2, 0, 31, 29, 380000)],
[40479, datetime.datetime(2014, 3, 2, 0, 39, 6, 800000)],
[4598, datetime.datetime(2014, 3, 2, 1, 48, 47, 430000)]]
One ID
will appear multiple times in the data file with different date time, and I only want to select each ID
with the furthest time.
For example, in the sample data above, I only need to select:
[[41186, datetime.datetime(2014, 3, 1, 20, 48, 5, 630000)],
[32036, datetime.datetime(2014, 3, 2, 0, 25, 41, 950000)],
[40479, datetime.datetime(2014, 3, 2, 0, 39, 6, 800000)],
[4598, datetime.datetime(2014, 3, 2, 1, 48, 47, 430000)]]
How can I write a query to get this output? Thank you.
Use groupByKey
and apply min
:
print(rdd.groupByKey().mapValues(min).collect())
#[(41186, datetime.datetime(2014, 3, 1, 20, 48, 5, 630000)),
# (32036, datetime.datetime(2014, 3, 2, 0, 25, 41, 950000)),
# (4598, datetime.datetime(2014, 3, 2, 1, 48, 47, 430000)),
# (40479, datetime.datetime(2014, 3, 2, 0, 39, 6, 800000))]