I have a sample DataFrame as such:
df = pd.DataFrame(data=[('foo', datetime.date(2014, 10, 1)),
('foo', datetime.date(2014, 10, 2)),
('bar', datetime.date(2014, 10, 3)),
('bar', datetime.date(2014, 10, 1))],
columns=('name', 'date'))
which looks like this:
name date
0 foo 2014-10-01
1 foo 2014-10-02
2 bar 2014-10-03
3 bar 2014-10-01
I want to restrict the dataframe to just the last incident of each element in the name column, how do I do this?
I could awkwardly (at least I think it would be awkward) construct a boolean Series object to do this and pass it to the DataFrame's __getitem__
, like this:
pd[latest_name]
How do I most elegantly get the latest entry for each name
element?
A coworker just had a very similar question to this.
With a DataFrame object like this:
name date
0 foo 2014-10-01
1 foo 2014-10-02
2 bar 2014-10-03
3 bar 2014-10-01
You can sort by the date and then drop the duplicates, keeping the last ones like this:
last = df.sort(columns=('date',)).drop_duplicates(cols=('name',), take_last=True)
# note cols is deprecated in more recent versions of pandas,
# and you should use subset='name' if available to you
and last
is now:
name date
1 foo 2014-10-02
2 bar 2014-10-03
[2 rows x 2 columns]
But it may be preferable to set the date as the index, if we can drop the old indexes, and then just sort by the index:
df = df.set_index('date')
df = df.sort_index() # inplace=True is deprecated, so must assign
df
now returns:
name
date
2014-10-01 foo
2014-10-01 bar
2014-10-02 foo
2014-10-03 bar
Now to just take the last elements:
last_elements_frame = df.drop_duplicates(take_last=True)
and last_elements_frame
is now:
name
date
2014-10-02 foo
2014-10-03 bar