I have a DataFrame which have a column with statuses like this:
datetime | session | try | status
2020-09-17 10:00:01 | '1a' | '1a_1' | 'success'
2020-09-17 10:00:02 | '2a' | '2a_1' | 'fail'
2020-09-17 10:00:03 | '2a' | '2a_2' | 'success'
2020-09-17 10:00:03 | '3a' | '3a_1' | 'interrupted'
2020-09-18 13:00:04 | '4a' | '4a_1' | 'fail'
I want to resample data by day with counting status types by condition in sessions (not tries).
I can resample it by tries easyly like this:
df['date'] = df['datetime'].dt.date
df['ones'] = np.ones(df.shape[0])
piv = df.pivot_table(index='date', columns='status', values='ones', aggfunc=len).fillna(0)
And have:
day | success | fail | interrupted
2020-09-17 | 2 | 2 | 1
2020-09-18 | 0 | 1 | 0
But I want to aggregate it by session with condition no matter how much tries in session.:
So I shoul get something like this:
day | success | fail | interrupted
2020-09-17 | 2 | 0 | 1
2020-09-18 | 0 | 1 | 0
I stuck with function and all I come up with ends with "ValueError: The truth value of a Series is ambiguous". I will be very greatfull for any ideas.
My idea is convert values of statust to ordered categories, sorting and get row only with most important value which is defined in list passed to parameter categories
:
print (df)
datetime session try status
0 2020-09-17 10:00:01 1a 1a_1 success
1 2020-09-17 10:00:02 2a 2a_1 fail
2 2020-09-17 10:00:03 2a 2a_2 success
3 2020-09-17 10:00:03 3a 3a_1 interrupted
4 2020-09-18 13:00:04 4a 4a_1 fail
5 2020-09-19 10:00:01 1a 1a_1 interrupted
6 2020-09-19 10:00:02 1a 2a_1 fail
7 2020-09-19 10:00:03 2a 2a_2 success
8 2020-09-19 10:00:03 2a 3a_1 interrupted
df['status'] = pd.Categorical(df['status'],
ordered=True,
categories=['success','interrupted','fail'])
df['date'] = df['datetime'].dt.date
df1 = df.sort_values(['date','status']).drop_duplicates(['date','session'])
print (df1)
datetime session try status date
0 2020-09-17 10:00:01 1a 1a_1 success 2020-09-17
2 2020-09-17 10:00:03 2a 2a_2 success 2020-09-17
3 2020-09-17 10:00:03 3a 3a_1 interrupted 2020-09-17
4 2020-09-18 13:00:04 4a 4a_1 fail 2020-09-18
7 2020-09-19 10:00:03 2a 2a_2 success 2020-09-19
5 2020-09-19 10:00:01 1a 1a_1 interrupted 2020-09-19
piv = pd.crosstab(df1['date'], df1['status'])
print (piv)
status success interrupted fail
date
2020-09-17 2 1 0
2020-09-18 0 0 1
2020-09-19 1 1 0