Search code examples
pandasdategroup-by

group by with two operation


I have this dataframe:

date,qq,q_t
1956-01-01,  1,   4
1956-01-02,  2,   5
1956-01-03,  3,   1
1956-01-04,  4,   1
1956-01-05,  5,   1
1956-01-06,  6,   10
1956-01-07,  7,   11
1956-01-08,  8,   12
1956-01-09,  9,   5
1956-01-10,  10,  3
1956-01-11,  11,  3
1956-01-12,  12,  3
1956-01-13,  13,  50
1956-01-14,  14,  51
1956-01-15,  15,  52
1956-01-16,  16,  53
1956-01-17,  17,  1
1956-01-18,  18,  23
1956-01-19,  19,  1

As a first step I compute the difference between 'qq' and 'qt'. Then I compute the number of consecutive days when the difference between 'qq' and 'qt' is positive as:

dfr   = pd.read_csv('test.csv', sep=',',index_col=0,parse_dates=True)


dfr['diff'] = dfr['q_t'] - dfr['qq']

dfr['test'] = np.where(dfr['diff']>=0, True, False)



dfr['diff'] = np.where(dfr['diff']<0 , np.nan, dfr['diff'])

dfr['group'] = (dfr['test'] != dfr['test'].shift()).cumsum()

dfr.reset_index(inplace=True)

true_values = dfr[dfr['test']]

consecutive_days = true_values.groupby('group')['date'].agg(['min', 'max', 'count'])

As it could be noticed, these are the resulting data-frames:

dfr as:

    date    qq  q_t diff    test    group
0   1956-01-01  1   4   3.0 True    1
1   1956-01-02  2   5   3.0 True    1
2   1956-01-03  3   1       False   2
3   1956-01-04  4   1       False   2
4   1956-01-05  5   1       False   2
5   1956-01-06  6   10  4.0 True    3
6   1956-01-07  7   11  4.0 True    3
7   1956-01-08  8   12  4.0 True    3
8   1956-01-09  9   5       False   4
9   1956-01-10  10  3       False   4
10  1956-01-11  11  3       False   4
11  1956-01-12  12  3       False   4
12  1956-01-13  13  50  37.0    True    5
13  1956-01-14  14  51  37.0    True    5
14  1956-01-15  15  52  37.0    True    5
15  1956-01-16  16  53  37.0    True    5
16  1956-01-17  17  1       False   6
17  1956-01-18  18  23  5.0 True    7
18  1956-01-19  19  1       False   8

and consecutive_days as

group   min max count
1   1956-01-01  1956-01-02  2
3   1956-01-06  1956-01-08  3
5   1956-01-13  1956-01-16  4
7   1956-01-18  1956-01-18  1

I would like to have another column with the sum of all the True value with that belongs to the same group. In other words. i would like get the following result:

group   min max count  sum
1   1956-01-01  1956-01-02  2    6
3   1956-01-06  1956-01-08  3   12
5   1956-01-13  1956-01-16  4  148
7   1956-01-18  1956-01-18  1    5

My original idea was to work on dfr as

vol = dfr.groupby(by=['group']).sum(min_count=1)

This just before resetting the index. After that I could remove the lines with nan.

However, I would like to take advance of the first "group by". Is it possible to use two operation in the same "group by"?


Solution

  • You can do multiple aggregations in the first groupby operation

    consecutive_days = (
        true_values
        .groupby('group')
        .agg({'date': ['min', 'max', 'count'], 'diff': 'sum'})
        .droplevel(0, axis=1)
    )
    

                  min         max  count    sum
    group                                      
    1      1956-01-01  1956-01-02      2    6.0
    3      1956-01-06  1956-01-08      3   12.0
    5      1956-01-13  1956-01-16      4  148.0
    7      1956-01-18  1956-01-18      1    5.0