I'm taking periodic/daily snapshots of data. Each element has a unique id number, but the element's data could stay the same or change from one day to the next. I'd like to concatenate the daily snapshots, but remove rows were everything is the same from one day to the next (except for the date).
I've looked into groupby()
, diff()
, and ne()
. I'm guessing that the solution might be some combination, but I'm not sure. Also, there's the question of comparing the dataframes before the concatenation to limit what gets concatenated or to concatenate first and then trim the resulting dataframe. I'm working with the assumption that it's easier to concatenate first and then remove rows that don't meet the given criteria.
Here's some example data.
import pandas as pd
d1 = {'id': [1, 2, 3, 4], 'b': ['abc', 'bcd', 'cde', 'def'], 'c': ['foo', 'foo', 'bar', 'bar'], 'date': ['20190909', '20190909', '20190909','20190909']}
d1['date'] = pd.to_datetime(d1['date'])
df1 = pd.DataFrame(d1)
df1.set_index(['id', 'date'], inplace=True)
d2 = {'id': [2, 3, 4, 5], 'b': ['bcd', 'cde', 'xyz', 'xxx'], 'c': ['foo', 'foo', 'bar', 'bar'], 'date': ['20190908', '20190908', '20190908','20190908']}
d2['date'] = pd.to_datetime(d2['date'])
df2 = pd.DataFrame(d2)
df2.set_index(['id', 'date'], inplace=True)
If you concatenate first, you end up with the following.
df3 = pd.concat([df1, df2])
df3
b c
id date
1 2019-09-09 abc foo
2 2019-09-09 bcd foo
3 2019-09-09 cde bar
4 2019-09-09 def bar
2 2019-09-08 bcd foo
3 2019-09-08 cde foo
4 2019-09-08 xyz bar
5 2019-09-08 xxx bar
In this example, id == 2
is the only row that is the same in all respects except for the date. It appears twice, but I only want to keep it's oldest appearance date == '2019-09-08'
and remove the newer instance date == '2019-09-09'
.
I don't know how to get here programmatically, but I'd like to end up with a dataframe that looks like this.
df3.drop([(2, '2019-09-09')])
b c
id date
1 2019-09-09 abc foo
3 2019-09-09 cde bar
4 2019-09-09 def bar
2 2019-09-08 bcd foo
3 2019-09-08 cde foo
4 2019-09-08 xyz bar
5 2019-09-08 xxx bar
You basically need .drop_duplicates(keep='last')
with some constraints because of the multi-index you are using :
# Move 'id' from the index to a new column and drop duplicated values
result = df3.reset_index('id').drop_duplicates(keep='last')
# Reset the dataframe to the original structure
result.set_index(['id',result.index], drop=True)
Note that I'm using keep='last'
here supposing that your data is ordered by descending date. You may need to sort first.