group by with two operation

I have this dataframe:

date,qq,q_t
1956-01-01,  1,   4
1956-01-02,  2,   5
1956-01-03,  3,   1
1956-01-04,  4,   1
1956-01-05,  5,   1
1956-01-06,  6,   10
1956-01-07,  7,   11
1956-01-08,  8,   12
1956-01-09,  9,   5
1956-01-10,  10,  3
1956-01-11,  11,  3
1956-01-12,  12,  3
1956-01-13,  13,  50
1956-01-14,  14,  51
1956-01-15,  15,  52
1956-01-16,  16,  53
1956-01-17,  17,  1
1956-01-18,  18,  23
1956-01-19,  19,  1

As a first step I compute the difference between 'qq' and 'qt'. Then I compute the number of consecutive days when the difference between 'qq' and 'qt' is positive as:

dfr   = pd.read_csv('test.csv', sep=',',index_col=0,parse_dates=True)


dfr['diff'] = dfr['q_t'] - dfr['qq']

dfr['test'] = np.where(dfr['diff']>=0, True, False)



dfr['diff'] = np.where(dfr['diff']<0 , np.nan, dfr['diff'])

dfr['group'] = (dfr['test'] != dfr['test'].shift()).cumsum()

dfr.reset_index(inplace=True)

true_values = dfr[dfr['test']]

consecutive_days = true_values.groupby('group')['date'].agg(['min', 'max', 'count'])

As it could be noticed, these are the resulting data-frames:

dfr as:

    date    qq  q_t diff    test    group
0   1956-01-01  1   4   3.0 True    1
1   1956-01-02  2   5   3.0 True    1
2   1956-01-03  3   1       False   2
3   1956-01-04  4   1       False   2
4   1956-01-05  5   1       False   2
5   1956-01-06  6   10  4.0 True    3
6   1956-01-07  7   11  4.0 True    3
7   1956-01-08  8   12  4.0 True    3
8   1956-01-09  9   5       False   4
9   1956-01-10  10  3       False   4
10  1956-01-11  11  3       False   4
11  1956-01-12  12  3       False   4
12  1956-01-13  13  50  37.0    True    5
13  1956-01-14  14  51  37.0    True    5
14  1956-01-15  15  52  37.0    True    5
15  1956-01-16  16  53  37.0    True    5
16  1956-01-17  17  1       False   6
17  1956-01-18  18  23  5.0 True    7
18  1956-01-19  19  1       False   8

and consecutive_days as

group   min max count
1   1956-01-01  1956-01-02  2
3   1956-01-06  1956-01-08  3
5   1956-01-13  1956-01-16  4
7   1956-01-18  1956-01-18  1

I would like to have another column with the sum of all the True value with that belongs to the same group. In other words. i would like get the following result:

group   min max count  sum
1   1956-01-01  1956-01-02  2    6
3   1956-01-06  1956-01-08  3   12
5   1956-01-13  1956-01-16  4  148
7   1956-01-18  1956-01-18  1    5

My original idea was to work on dfr as

vol = dfr.groupby(by=['group']).sum(min_count=1)

This just before resetting the index. After that I could remove the lines with nan.

However, I would like to take advance of the first "group by". Is it possible to use two operation in the same "group by"?

Solution

You can do multiple aggregations in the first groupby operation

consecutive_days = (
    true_values
    .groupby('group')
    .agg({'date': ['min', 'max', 'count'], 'diff': 'sum'})
    .droplevel(0, axis=1)
)

              min         max  count    sum
group                                      
1      1956-01-01  1956-01-02      2    6.0
3      1956-01-06  1956-01-08      3   12.0
5      1956-01-13  1956-01-16      4  148.0
7      1956-01-18  1956-01-18      1    5.0