Search code examples
pythonpandascrosstab

Pandas count event per day from join date


I have this data frame:

name    event     join_date    created_at    
A       X         2020-12-01   2020-12-01
A       X         2020-12-01   2020-12-01
A       X         2020-12-01   2020-12-02
A       Y         2020-12-01   2020-12-02
B       X         2020-12-05   2020-12-05
B       X         2020-12-05   2020-12-07
C       X         2020-12-07   2020-12-08
C       X         2020-12-07   2020-12-09
...

I want to transform it into this data frame:

name   event    join_date    day_0   day_1    day_2 .... day_n
A      X        2020-12-01   2       1        0          0
A      Y        2020-12-01   0       1        0          0
B      X        2020-12-05   1       0        1          0
C      X        2020-12-07   0       1        1          0
...

the first rows mean that user A doing twice Event X on day_0 (first day he joins) and once on the first day and so on until day_n

For now, the result is like this:

name   event    join_date    day_0   day_1    day_2 .... day_n
A      X        2020-12-01   2       1        0          0
A      Y        2020-12-01   0       1        0          0
B      X        2020-12-05   1       0        1          0
C      X        2020-12-07   1       1        0          0
...

the code set the 2020-12-02 as day_0, not day_1 because there is no 2020-12-01 on A user with Y event


Solution

  • First subtract all values created_at by first value per groups by GroupBy.transform.

    Then use DataFrame.pivot_table first, add all possible datetimes by DataFrame.reindex by timedelta_range and then convert columns names by range:

    df['d'] = df['created_at'].sub(df['join_date'])
    print (df)
      name event  join_date created_at      d
    0    A     X 2020-12-01 2020-12-01 0 days
    1    A     X 2020-12-01 2020-12-01 0 days
    2    A     X 2020-12-01 2020-12-02 1 days
    3    A     Y 2020-12-01 2020-12-02 1 days
    4    B     X 2020-12-05 2020-12-05 0 days
    5    B     X 2020-12-05 2020-12-07 2 days
    6    C     X 2020-12-07 2020-12-08 1 days
    7    C     X 2020-12-07 2020-12-09 2 days
    
    df1 = (df.pivot_table(index=['name','event','join_date'], 
                         columns='d', 
                         aggfunc='size', 
                         fill_value=0)
             .reindex(pd.timedelta_range(df['d'].min(), df['d'].max()), 
                      axis=1, 
                      fill_value=0))
    df1.columns = [f'day_{i}' for i in range(len(df1.columns))]
    df1 = df1.reset_index()
    print (df1)
      name event  join_date  day_0  day_1  day_2
    0    A     X 2020-12-01      2      1      0
    1    A     Y 2020-12-01      0      1      0
    2    B     X 2020-12-05      1      0      1
    3    C     X 2020-12-07      0      1      1