Search code examples
pythonpandasdataframetime-seriescumsum

Pandas time series cumsum of previous records matching a column of current row


I'm trying to add a column that represents the cumulative sum of the previous records that match the current value of the row for a given column, let's take an example.

My data is :

|t         |label|qty|cumsum|
|2020-01-05|A    |10 |10    |
|2020-01-06|B    |5  |5     |
|2020-01-06|C    |2  |2     |
|2020-01-07|A    |-5 |5     |
|2020-01-07|A    |2  |7     |
|2020-01-08|B    |-30|-25   |
|2020-01-09|C    |5  |7     |

The cumsum column would be summarized as "do the cumulative sum of all past records where label is the same as the current row"

Ideally for code clarity, I'd like an expression such as df['cumsum']=df.loc[(df['label']==current label) & (df['t'] <= current index)].cumsum() that doesn't go explicitly through looping each previous columns in a for ix,row in df:

Another aspect is that the cumulative sum is done from the start, may the dataframe be 20 records long or 500k records long.

Also, for clarity I just put 3 labels in my example but I can have dozens of labels in my dataset

I found some other similar issues but they don't exactly suit my need :

Any kind of help is welcome ! Thanks !


Solution

  • df.groupby('label')['qty'].transform('cumsum')