Search code examples
pythonsqlpandasanalyticswindow-functions

Analytic Sliding Windows function in Python Pandas


Have table:

list_1= [['2016-01-01',1,'King', 1000],    
        ['2016-01-02',1,'King', -200],    
        ['2016-01-03',1,'King', 100],    
        ['2016-01-04',1,'King',-400],    
        ['2016-01-05',1,'King', 200],    
        ['2016-01-06',1,'King',  -200],    
        ['2016-01-01',2,'Smith',  1000],    
        ['2016-01-02',2,'Smith',  -300],    
        ['2016-01-03',2,'Smith',  -600],    
        ['2016-01-04',2,'Smith',  100],    
        ['2016-01-05',2,'Smith',  -100]]
labels=['a_date','c_id','c_name','c_action']
df=pd.DataFrame(list_1,columns=labels)
df

OUT:

    a_date       c_id   c_name  c_action
0   2016-01-01     1    King    1000
1   2016-01-02     1    King    -200
2   2016-01-03     1    King    100
3   2016-01-04     1    King    -400
4   2016-01-05     1    King    200
5   2016-01-06     1    King    -200
6   2016-01-01     2    Smith   1000
7   2016-01-02     2    Smith   -300
8   2016-01-03     2    Smith   -600
9   2016-01-04     2    Smith   100
10  2016-01-05     2    Smith   -100

Need to get table:

a_date      c_id    c_name  c_amount    Balance
2016-01-01     1    King    1000        1000
2016-01-02     1    King    -200        800
2016-01-03     1    King    100         900
2016-01-04     1    King    -400        500
2016-01-05     1    King    200         700
2016-01-06     1    King    -200        500
2016-01-01     2    Smith   1000        1000
2016-01-02     2    Smith   -300        700
2016-01-03     2    Smith   -600        100
2016-01-04     2    Smith   100         200
2016-01-05     2    Smith   -100        100

So i need make "Balance" column with cumulative amount after each action for each customer. This equivalent for SQL query :

SELECT *,
        SUM(c_amount) OVER (PARTITION BY c_id ORDER BY a_date) AS 'Balance'
FROM account_actions

For both customers the solution is not difficult, can divide table by c_id, summarize and consolidate back.But it should be a dynamic solution for 10000 customers ...


Solution

  • As @Vaishali commented, this is groupby and cumsum. You may want to do sort_values to make sure that the data is sorted in order, although it appears already so:

    # sort by `c_id` and `a_date`
    df = df.sort_values(['c_id','a_date'])
    
    df['balance'] = df.groupby('c_id')['c_action'].cumsum()
    

    Output:

            a_date  c_id c_name  c_action  balance
    0   2016-01-01     1   King      1000     1000
    1   2016-01-02     1   King      -200      800
    2   2016-01-03     1   King       100      900
    3   2016-01-04     1   King      -400      500
    4   2016-01-05     1   King       200      700
    5   2016-01-06     1   King      -200      500
    6   2016-01-01     2  Smith      1000     1000
    7   2016-01-02     2  Smith      -300      700
    8   2016-01-03     2  Smith      -600      100
    9   2016-01-04     2  Smith       100      200
    10  2016-01-05     2  Smith      -100      100