Search code examples
pythonpandaswhere-clausedataframerestrict

Get the latest of each element of a Pandas DataFrame, with range indexing and a date column?


I have a sample DataFrame as such:

df = pd.DataFrame(data=[('foo', datetime.date(2014, 10, 1)), 
                        ('foo', datetime.date(2014, 10, 2)), 
                        ('bar', datetime.date(2014, 10, 3)), 
                        ('bar', datetime.date(2014, 10, 1))], 
                  columns=('name', 'date'))

which looks like this:

  name        date
0  foo  2014-10-01
1  foo  2014-10-02
2  bar  2014-10-03
3  bar  2014-10-01

I want to restrict the dataframe to just the last incident of each element in the name column, how do I do this?

I could awkwardly (at least I think it would be awkward) construct a boolean Series object to do this and pass it to the DataFrame's __getitem__, like this:

pd[latest_name]

How do I most elegantly get the latest entry for each name element?


Solution

  • A coworker just had a very similar question to this.

    With a DataFrame object like this:

      name        date
    0  foo  2014-10-01
    1  foo  2014-10-02
    2  bar  2014-10-03
    3  bar  2014-10-01
    

    You can sort by the date and then drop the duplicates, keeping the last ones like this:

    last = df.sort(columns=('date',)).drop_duplicates(cols=('name',), take_last=True)
    # note cols is deprecated in more recent versions of pandas,
    # and you should use subset='name' if available to you
    

    and last is now:

      name        date
    1  foo  2014-10-02
    2  bar  2014-10-03
    
    [2 rows x 2 columns]
    

    But it may be preferable to set the date as the index, if we can drop the old indexes, and then just sort by the index:

    df = df.set_index('date')
    df = df.sort_index() # inplace=True is deprecated, so must assign
    

    df now returns:

               name
    date           
    2014-10-01  foo
    2014-10-01  bar
    2014-10-02  foo
    2014-10-03  bar
    

    Now to just take the last elements:

    last_elements_frame = df.drop_duplicates(take_last=True)
    

    and last_elements_frame is now:

               name
    date           
    2014-10-02  foo
    2014-10-03  bar