Search code examples
pythonpandaspandas-applypandas-resample

Pandas resample and aggregate with condition


I have a DataFrame which have a column with statuses like this:

datetime               |    session    |    try       |    status
2020-09-17 10:00:01    |    '1a'       |    '1a_1'    |    'success'
2020-09-17 10:00:02    |    '2a'       |    '2a_1'    |    'fail'
2020-09-17 10:00:03    |    '2a'       |    '2a_2'    |    'success'
2020-09-17 10:00:03    |    '3a'       |    '3a_1'    |    'interrupted'
2020-09-18 13:00:04    |    '4a'       |    '4a_1'    |    'fail'

I want to resample data by day with counting status types by condition in sessions (not tries).

I can resample it by tries easyly like this:

df['date'] = df['datetime'].dt.date
df['ones'] = np.ones(df.shape[0])
piv = df.pivot_table(index='date', columns='status', values='ones', aggfunc=len).fillna(0)

And have:

day           |    success    |    fail    |    interrupted
2020-09-17    |    2          |    2       |    1
2020-09-18    |    0          |    1       |    0

But I want to aggregate it by session with condition no matter how much tries in session.:

  • if 'success' status in session tries, then success +1, fail +0, interrupted +0;
  • if 'interrupted' status in session AND NO 'success' status in session, then success +0, fail +0, interrupted +1;
  • if NO 'interrupted' AND 'success' statuses in session, then then success +0, fail +1, interrupted +0.

So I shoul get something like this:

day           |    success    |    fail    |    interrupted
2020-09-17    |    2          |    0       |    1
2020-09-18    |    0          |    1       |    0

I stuck with function and all I come up with ends with "ValueError: The truth value of a Series is ambiguous". I will be very greatfull for any ideas.


Solution

  • My idea is convert values of statust to ordered categories, sorting and get row only with most important value which is defined in list passed to parameter categories:

    print (df)
                 datetime session   try       status
    0 2020-09-17 10:00:01      1a  1a_1      success
    1 2020-09-17 10:00:02      2a  2a_1         fail
    2 2020-09-17 10:00:03      2a  2a_2      success
    3 2020-09-17 10:00:03      3a  3a_1  interrupted
    4 2020-09-18 13:00:04      4a  4a_1         fail
    5 2020-09-19 10:00:01      1a  1a_1  interrupted
    6 2020-09-19 10:00:02      1a  2a_1         fail
    7 2020-09-19 10:00:03      2a  2a_2      success
    8 2020-09-19 10:00:03      2a  3a_1  interrupted
    
    df['status'] = pd.Categorical(df['status'], 
                                  ordered=True, 
                                  categories=['success','interrupted','fail'])
    df['date'] = df['datetime'].dt.date
    
    df1 = df.sort_values(['date','status']).drop_duplicates(['date','session'])
    print (df1)
                 datetime session   try       status        date
    0 2020-09-17 10:00:01      1a  1a_1      success  2020-09-17
    2 2020-09-17 10:00:03      2a  2a_2      success  2020-09-17
    3 2020-09-17 10:00:03      3a  3a_1  interrupted  2020-09-17
    4 2020-09-18 13:00:04      4a  4a_1         fail  2020-09-18
    7 2020-09-19 10:00:03      2a  2a_2      success  2020-09-19
    5 2020-09-19 10:00:01      1a  1a_1  interrupted  2020-09-19
    
    piv = pd.crosstab(df1['date'], df1['status'])
    print (piv)
    status      success  interrupted  fail
    date                                  
    2020-09-17        2            1     0
    2020-09-18        0            0     1
    2020-09-19        1            1     0