Search code examples
pandasdatepandas-groupbyshiftcumsum

How to sum values under GroupBy and consecutive date conditions?


Given table:

ID LINE SITE DATE UNITS TOTAL
1 X AAA 02-May-2017 12 30
2 X AAA 03-May-2017 10 22
3 X AAA 04-May-2017 22 40
4 Z AAA 20-MAY-2017 15 44
5 Z AAA 21-May-2017 8 30
6 Z BBB 22-May-2017 10 32
7 Z BBB 23-May-2017 25 52
8 K CCC 02-Jun-2017 6 22
9 K CCC 03-Jun-2017 4 33
10 K CCC 12-Aug-2017 11 44
11 K CCC 13-Aug-2017 19 40
12 K CCC 14-Aug-2017 30 40

for each row if ID,LINE ,SITE equal to previous row (day) need to calculate as below (last day) and (last 3 days ) : Note that is need to insure date are consecutive under "groupby" of ID,LINE ,SITE columns

ID LINE SITE DATE UNITS TOTAL Last day Last 3 days
1 X AAA 02-May-2017 12 30 0 0
2 X AAA 03-May-2017 10 22 12/30 12/30
3 X AAA 04-May-2017 22 40 10/22 (10+12)/(30+22)
4 Z AAA 20-MAY-2017 15 44 0 0
5 Z AAA 21-May-2017 8 30 15/44 15/44
6 Z BBB 22-May-2017 10 32 0 0
7 Z BBB 23-May-2017 25 52 10/32 10/32
8 K CCC 02-Jun-2017 6 22 0 0
9 K CCC 03-Jun-2017 4 33 6/22 6/22
10 K CCC 12-Aug-2017 11 44 4/33 0
11 K CCC 13-Aug-2017 19 40 11/44 (11/44)
12 K CCC 14-Aug-2017 30 40 19/40 (11+19/44+40)

Solution

  • In this cases i usually do a for loop with groupby:

    import pandas as pd
    import numpy as np
    
    #copied your table
    table = pd.read_csv('/home/fm/Desktop/stackover.csv')
    table.set_index('ID', inplace = True)
    table[['Last day','Last 3 days']] = np.nan
    
    for i,r in table.groupby(['LINE' ,'SITE']):
        #First subset non sequential dates
        limits_interval = pd.to_datetime(r['DATE']).diff() != '1 days'
        #First element is a false positive, as its impossible to calculate past days from first day
        limits_interval.iloc[0]=False
    
        ids_subset = r.index[limits_interval].to_list()
        ids_subset.append(r.index[-1]+1) #to consider all values
        id_start = 0
    
        for id_end in ids_subset:    
            r_sub = r.loc[id_start:id_end-1, :].copy()
            id_start = id_end 
    
            #move all values one day off, if the database is as in your example (1 line per day) wont have problems
            r_shifted = r_sub.shift(1)
    
            r_sub['Last day']=r_shifted['UNITS']/r_shifted['TOTAL']
    
            aux_units_cumsum = r_shifted['UNITS'].cumsum()
            aux_total_cumsum = r_shifted['TOTAL'].cumsum()
    
            r_sub['Last 3 days'] = aux_units_cumsum/aux_total_cumsum
    
            r_sub.fillna(0, inplace = True)
    
            table.loc[r_sub.index,:]=r_sub.copy()
    

    You can make a function and apply in groupby, it would be cleaner: Apply function to pandas groupby. It would be more elegant. Wish I could help you, good luck