Search code examples
pythonpandasmulti-indexdifference

Identify rows in a dataframe that are identical except for the date index value from a multi-index?


I'm taking periodic/daily snapshots of data. Each element has a unique id number, but the element's data could stay the same or change from one day to the next. I'd like to concatenate the daily snapshots, but remove rows were everything is the same from one day to the next (except for the date).

I've looked into groupby(), diff(), and ne(). I'm guessing that the solution might be some combination, but I'm not sure. Also, there's the question of comparing the dataframes before the concatenation to limit what gets concatenated or to concatenate first and then trim the resulting dataframe. I'm working with the assumption that it's easier to concatenate first and then remove rows that don't meet the given criteria.

Here's some example data.

import pandas as pd

d1 = {'id': [1, 2, 3, 4], 'b': ['abc', 'bcd', 'cde', 'def'], 'c': ['foo', 'foo', 'bar', 'bar'], 'date': ['20190909', '20190909', '20190909','20190909']}
d1['date'] = pd.to_datetime(d1['date'])
df1 = pd.DataFrame(d1)
df1.set_index(['id', 'date'], inplace=True)

d2 = {'id': [2, 3, 4, 5], 'b': ['bcd', 'cde', 'xyz', 'xxx'], 'c': ['foo', 'foo', 'bar', 'bar'], 'date': ['20190908', '20190908', '20190908','20190908']}
d2['date'] = pd.to_datetime(d2['date'])
df2 = pd.DataFrame(d2)
df2.set_index(['id', 'date'], inplace=True)

If you concatenate first, you end up with the following.

df3 = pd.concat([df1, df2])
df3
                 b    c
id date                
1  2019-09-09  abc  foo
2  2019-09-09  bcd  foo
3  2019-09-09  cde  bar
4  2019-09-09  def  bar
2  2019-09-08  bcd  foo
3  2019-09-08  cde  foo
4  2019-09-08  xyz  bar
5  2019-09-08  xxx  bar

In this example, id == 2 is the only row that is the same in all respects except for the date. It appears twice, but I only want to keep it's oldest appearance date == '2019-09-08' and remove the newer instance date == '2019-09-09'.

I don't know how to get here programmatically, but I'd like to end up with a dataframe that looks like this.

df3.drop([(2, '2019-09-09')])
                 b    c
id date                
1  2019-09-09  abc  foo
3  2019-09-09  cde  bar
4  2019-09-09  def  bar
2  2019-09-08  bcd  foo
3  2019-09-08  cde  foo
4  2019-09-08  xyz  bar
5  2019-09-08  xxx  bar

Solution

  • You basically need .drop_duplicates(keep='last') with some constraints because of the multi-index you are using :

    # Move 'id' from the index to a new column and drop duplicated values
    result = df3.reset_index('id').drop_duplicates(keep='last') 
    
    # Reset the dataframe to the original structure
    result.set_index(['id',result.index], drop=True)
    

    Note that I'm using keep='last' here supposing that your data is ordered by descending date. You may need to sort first.