Search code examples
pythonpandasgroup-bycumulative-sumproportions

Pandas group-by proportion of cumulative sum start from 0


I have the following pandas Data Frame (without 2 the last columns):

   name        day   show-in-appointment    previous-missed-appointments   proportion-previous-missed
0  Jack   2020/01/01   show                              0                      0
1  Jack   2020/01/02   no-show                           0                      0
2  Jill   2020/01/02   no-show                           0                      0
3  Jack   2020/01/03   show                              1                      0.5
4  Jill   2020/01/03   show                              1                      1
5  Jill   2020/01/04   no-show                           1                      0.5
6  Jack   2020/01/04   show                              1                      0.33
7  Jill   2020/01/05   show                              2                      0.66
8  jack   2020/01/06   no-show                           1                      0.25
9  jack   2020/01/07   show                              2                 0.4>>>2(noshow)/5(noshow+show)
df = pd.DataFrame(
    data=np.asarray([
        ['Jack', 'Jack', 'Jill', 'Jack', 'Jill', 'Jill', 'Jack', 'Jill', 'jack', 'jack'],
        [
            '2020/01/01',
            '2020/01/02',
            '2020/01/02',
            '2020/01/03',
            '2020/01/03',
            '2020/01/04',
            '2020/01/04',
            '2020/01/05',
            '2020/01/06',
            '2020/01/07',
        ],
        ['show', 'no-show', 'no-show', 'show', 'show', 'no-show', 'show', 'show', 'no-show', 'show'],
    ]).T,
    columns=['name', 'day', 'show-in-appointment'],
)

previous-missed-appointments column is create like the code below:

df.name = df.name.str.capitalize()
df['order'] = df.index
df.day = pd.to_datetime(df.day)
df['noshow'] = df['show-in-appointment'].map({'show': 0, 'no-show': 1})
df = df.sort_values(by=['name', 'day'])
df['previous-missed-appointments'] = df.groupby('name').noshow.cumsum()
df.loc[df.noshow == 1, 'previous-missed-appointments'] -= 1
df = df.sort_values(by='order')
df = df.drop(columns=['noshow', 'order'])

********THE QUESTION IS

HOW CAN I CREATE THE LAST COLUMN ????*********


Solution

  • You can use cumsum and shift in groupby.apply for the first column, then divide by groupby.cumcount for the second column:

    # ensure dates are sorted
    df = df.sort_values(by='day', key=lambda s: pd.to_datetime(s, dayfirst=False))
    
    m = df['show-in-appointment'].eq('no-show')
    
    g = m.groupby(df['name'].str.casefold(), group_keys=False)
    df['previous-missed-appointments'] =  (
      g.apply(lambda x: x.cumsum().shift(fill_value=0))
     )
    
    df['proportion-previous-missed'] = (
        df['previous-missed-appointments'].div(g.cumcount()).fillna(0)
    )
    
    print(df)
    

    NB. it is important that the input is first sorted by date (or name/date).

    Output:

       name         day show-in-appointment  previous-missed-appointments  proportion-previous-missed
    0  Jack  2020/01/01                show                             0                    0.000000
    1  Jack  2020/01/02             no-show                             0                    0.000000
    2  Jill  2020/01/02             no-show                             0                    0.000000
    3  Jack  2020/01/03                show                             1                    0.500000
    4  Jill  2020/01/03                show                             1                    1.000000
    5  Jill  2020/01/04             no-show                             1                    0.500000
    6  Jack  2020/01/04                show                             1                    0.333333
    7  Jill  2020/01/05                show                             2                    0.666667
    8  jack  2020/01/06             no-show                             1                    0.250000
    9  jack  2020/01/07                show                             2                    0.400000
    

    Intermediates (with names grouped together and shorter column names for clarity):

       name         day     show  previous-missed computation  proportion
    0  Jack  2020/01/01     show                0         0/0        0.00
    1  Jack  2020/01/02  no-show                0         0/1        0.00
    3  Jack  2020/01/03     show                1         1/2        0.50
    6  Jack  2020/01/04     show                1         1/3        0.33
    8  Jack  2020/01/06  no-show                1         1/4        0.25
    9  Jack  2020/01/07     show                2         2/5        0.40
    2  Jill  2020/01/02  no-show                0         0/0        0.00
    4  Jill  2020/01/03     show                1         1/1        1.00
    5  Jill  2020/01/04  no-show                1         1/2        0.50
    7  Jill  2020/01/05     show                2         2/3        0.67