I have a table called clients and I want to show how many times someone registered or purchased an item based on the userid.
The goal is to have a table that shows the sum of registration_complete and purchased based on userid
This is the code that I wrote. Unfortunately not all columns show up
new_file= new_data.groupby(['userid'])
['Registration_Complete','Purchase'].agg('sum')
new_file.head(5)
This is the table I'm using to figure out the registration and purchased based on userid
Event_day timestamp install userid registration purchase
1/1/1900 1/1/1900 16:10 yes 555221 1 0
1/1/1900 1/1/1900 16:12 yes 555221 1 1
2/19/2010 1/19/2010 16:40 no 533211 0 1
2/19/2010 1/19/2016 16:53 yes 533211 0 1
2/20/2017 2/20/2017 15:46 yes 53200 1 0
3/15/2017 3/15/2018 15:48 yes 53200 1 0
3/15/2017 3/15/2018 20:14 yes 53200 1 0
and I want something that will give me the sum
Event_day timestamp install userid registration purchase
1/1/1900 1/1/1900 16:10 yes 555221 2 0
2/19/2010 1/19/2016 16:53 yes 533211 0 2
3/15/2017 3/15/2018 20:14 yes 53200 5 0
IIUC you can keep the first
or last
value of other columns passing a dict to agg
agg = {'Event_day': 'last', 'timestamp': 'last', 'install': 'last', 'registration': 'sum', 'purchase': 'sum'}
df.groupby('userid').agg(agg).reset_index()
userid Event_day timestamp install registration purchase
0 53200 3/15/2017 3/15/2018 20:14 yes 3 0
1 533211 2/19/2010 1/19/2016 16:53 yes 0 2
2 555221 1/1/1900 1/1/1900 16:12 yes 2 1
Keeping in mind that several answers are possibly correct, I find it interesting to do a performance test between them
Timings
dfg1 = df.groupby("userid")["install", "timestamp", "Event_day"].max()
dfg2 = df.groupby("userid").sum()
pd.concat([dfg1, dfg2], axis=1)
38.5 ms ± 393 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
first_other_columns = df[['Event_day', 'timestamp', 'install', 'userid']].drop_duplicates(['userid'], keep='first')
grouped = df.groupby(['userid']).sum().reset_index()
pd.merge(grouped, first_other_columns, on=['userid'])
11.3 ms ± 100 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
agg = {'Event_day': 'last', 'timestamp': 'last', 'install': 'last', 'registration': 'sum', 'purchase': 'sum'}
df.groupby('userid').agg(agg).reset_index()
6.85 ms ± 62.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)