Search code examples
pythonpandasdifference

Pandas difference between row by multiple conditions


So im having this challenge right now. I have a pandas df(with over 30 million rows) that looks like follows:

Name  | State | Date       | Income
____________________________________
Name1 | Tx    | 2019-01-26 | $100
Name2 | CA    | 2019-01-26 | $500
Name3 | LA    | 2019-01-26 | $200
Name1 | Tx    | 2019-02-26 | $110
Name2 | CA    | 2019-02-26 | $520
Name3 | LA    | 2019-02-26 | $210

As you can see, i have 2 columns to identify a person (his name and State), then 1 column for a timestamp (which varies over month), what I'm trying to achieve is to get the difference between months for every person, resulting in something like:

Name  | State | Date       | Income | Variation
____________________________________
Name1 | Tx    | 2019-01-26 | $100   |  NaN
Name2 | CA    | 2019-01-26 | $500   |  NaN
Name3 | LA    | 2019-01-26 | $200   |  NaN
Name1 | Tx    | 2019-02-26 | $110   |  10
Name2 | CA    | 2019-02-26 | $480   | -20
Name3 | LA    | 2019-02-26 | $210   |  10

And so on for the incoming months I found the diff function, but i don't know how to specify that must match the same name and state but for the previous month


Solution

  • You should sort by Date, and use a groupby transform on ['Name', 'State']. This solution should work:

    df['Variation'] = (df.sort_values('Date').groupby(['Name', 'State'])['Income']
                       .transform(lambda x: x.diff()))