I have this data frame:
name event join_date created_at
A X 2020-12-01 2020-12-01
A X 2020-12-01 2020-12-01
A X 2020-12-01 2020-12-02
A Y 2020-12-01 2020-12-02
B X 2020-12-05 2020-12-05
B X 2020-12-05 2020-12-07
C X 2020-12-07 2020-12-08
C X 2020-12-07 2020-12-09
...
I want to transform it into this data frame:
name event join_date day_0 day_1 day_2 .... day_n
A X 2020-12-01 2 1 0 0
A Y 2020-12-01 0 1 0 0
B X 2020-12-05 1 0 1 0
C X 2020-12-07 0 1 1 0
...
the first rows mean that user A doing twice Event X on day_0 (first day he joins) and once on the first day and so on until day_n
For now, the result is like this:
name event join_date day_0 day_1 day_2 .... day_n
A X 2020-12-01 2 1 0 0
A Y 2020-12-01 0 1 0 0
B X 2020-12-05 1 0 1 0
C X 2020-12-07 1 1 0 0
...
the code set the 2020-12-02 as day_0, not day_1 because there is no 2020-12-01 on A user with Y event
First subtract all values created_at
by first value per groups by GroupBy.transform
.
Then use DataFrame.pivot_table
first, add all possible datetimes by DataFrame.reindex
by timedelta_range
and then convert columns names by range
:
df['d'] = df['created_at'].sub(df['join_date'])
print (df)
name event join_date created_at d
0 A X 2020-12-01 2020-12-01 0 days
1 A X 2020-12-01 2020-12-01 0 days
2 A X 2020-12-01 2020-12-02 1 days
3 A Y 2020-12-01 2020-12-02 1 days
4 B X 2020-12-05 2020-12-05 0 days
5 B X 2020-12-05 2020-12-07 2 days
6 C X 2020-12-07 2020-12-08 1 days
7 C X 2020-12-07 2020-12-09 2 days
df1 = (df.pivot_table(index=['name','event','join_date'],
columns='d',
aggfunc='size',
fill_value=0)
.reindex(pd.timedelta_range(df['d'].min(), df['d'].max()),
axis=1,
fill_value=0))
df1.columns = [f'day_{i}' for i in range(len(df1.columns))]
df1 = df1.reset_index()
print (df1)
name event join_date day_0 day_1 day_2
0 A X 2020-12-01 2 1 0
1 A Y 2020-12-01 0 1 0
2 B X 2020-12-05 1 0 1
3 C X 2020-12-07 0 1 1