I have this dataframe:
date,qq,q_t
1956-01-01, 1, 4
1956-01-02, 2, 5
1956-01-03, 3, 1
1956-01-04, 4, 1
1956-01-05, 5, 1
1956-01-06, 6, 10
1956-01-07, 7, 11
1956-01-08, 8, 12
1956-01-09, 9, 5
1956-01-10, 10, 3
1956-01-11, 11, 3
1956-01-12, 12, 3
1956-01-13, 13, 50
1956-01-14, 14, 51
1956-01-15, 15, 52
1956-01-16, 16, 53
1956-01-17, 17, 1
1956-01-18, 18, 23
1956-01-19, 19, 1
As a first step I compute the difference between 'qq' and 'qt'. Then I compute the number of consecutive days when the difference between 'qq' and 'qt' is positive as:
dfr = pd.read_csv('test.csv', sep=',',index_col=0,parse_dates=True)
dfr['diff'] = dfr['q_t'] - dfr['qq']
dfr['test'] = np.where(dfr['diff']>=0, True, False)
dfr['diff'] = np.where(dfr['diff']<0 , np.nan, dfr['diff'])
dfr['group'] = (dfr['test'] != dfr['test'].shift()).cumsum()
dfr.reset_index(inplace=True)
true_values = dfr[dfr['test']]
consecutive_days = true_values.groupby('group')['date'].agg(['min', 'max', 'count'])
As it could be noticed, these are the resulting data-frames:
dfr as:
date qq q_t diff test group
0 1956-01-01 1 4 3.0 True 1
1 1956-01-02 2 5 3.0 True 1
2 1956-01-03 3 1 False 2
3 1956-01-04 4 1 False 2
4 1956-01-05 5 1 False 2
5 1956-01-06 6 10 4.0 True 3
6 1956-01-07 7 11 4.0 True 3
7 1956-01-08 8 12 4.0 True 3
8 1956-01-09 9 5 False 4
9 1956-01-10 10 3 False 4
10 1956-01-11 11 3 False 4
11 1956-01-12 12 3 False 4
12 1956-01-13 13 50 37.0 True 5
13 1956-01-14 14 51 37.0 True 5
14 1956-01-15 15 52 37.0 True 5
15 1956-01-16 16 53 37.0 True 5
16 1956-01-17 17 1 False 6
17 1956-01-18 18 23 5.0 True 7
18 1956-01-19 19 1 False 8
and consecutive_days as
group min max count
1 1956-01-01 1956-01-02 2
3 1956-01-06 1956-01-08 3
5 1956-01-13 1956-01-16 4
7 1956-01-18 1956-01-18 1
I would like to have another column with the sum of all the True value with that belongs to the same group. In other words. i would like get the following result:
group min max count sum
1 1956-01-01 1956-01-02 2 6
3 1956-01-06 1956-01-08 3 12
5 1956-01-13 1956-01-16 4 148
7 1956-01-18 1956-01-18 1 5
My original idea was to work on dfr as
vol = dfr.groupby(by=['group']).sum(min_count=1)
This just before resetting the index. After that I could remove the lines with nan.
However, I would like to take advance of the first "group by". Is it possible to use two operation in the same "group by"?
You can do multiple aggregations in the first groupby operation
consecutive_days = (
true_values
.groupby('group')
.agg({'date': ['min', 'max', 'count'], 'diff': 'sum'})
.droplevel(0, axis=1)
)
min max count sum
group
1 1956-01-01 1956-01-02 2 6.0
3 1956-01-06 1956-01-08 3 12.0
5 1956-01-13 1956-01-16 4 148.0
7 1956-01-18 1956-01-18 1 5.0