So im having this challenge right now. I have a pandas df(with over 30 million rows) that looks like follows:
Name | State | Date | Income
____________________________________
Name1 | Tx | 2019-01-26 | $100
Name2 | CA | 2019-01-26 | $500
Name3 | LA | 2019-01-26 | $200
Name1 | Tx | 2019-02-26 | $110
Name2 | CA | 2019-02-26 | $520
Name3 | LA | 2019-02-26 | $210
As you can see, i have 2 columns to identify a person (his name and State), then 1 column for a timestamp (which varies over month), what I'm trying to achieve is to get the difference between months for every person, resulting in something like:
Name | State | Date | Income | Variation
____________________________________
Name1 | Tx | 2019-01-26 | $100 | NaN
Name2 | CA | 2019-01-26 | $500 | NaN
Name3 | LA | 2019-01-26 | $200 | NaN
Name1 | Tx | 2019-02-26 | $110 | 10
Name2 | CA | 2019-02-26 | $480 | -20
Name3 | LA | 2019-02-26 | $210 | 10
And so on for the incoming months I found the diff function, but i don't know how to specify that must match the same name and state but for the previous month
You should sort by Date
, and use a groupby transform on ['Name', 'State']
. This solution should work:
df['Variation'] = (df.sort_values('Date').groupby(['Name', 'State'])['Income']
.transform(lambda x: x.diff()))