Given I have following pandas dataframe
UserID | LoginDate | PlacedOrderItems | Other column |
---|---|---|---|
A123456 | 2023-04-01 | 1 | a |
A123456 | 2023-04-10 | 0 | b |
B123456 | 2023-04-05 | 0 | c |
B123456 | 2023-04-06 | 2 | d |
B123456 | 2023-04-07 | 0 | e |
I want to use group by UserID
and get min
of LoginDate
and max
of PlacedOrderItems
added to each row of my datafarme. I tried Group by but turns the column names into a weird multi level indexing.
But What I want is something like below where I preserve the original dataframe and just get two new columns added.
UserID | LoginDate | PlacedOrderItems | Other column | min_login_date | max_order_items |
---|---|---|---|---|---|
A123456 | 2023-04-01 | 1 | a | 2023-04-01 | 1 |
A123456 | 2023-04-10 | 0 | b | 2023-04-01 | 1 |
B123456 | 2023-04-05 | 0 | c | 2023-04-05 | 2 |
B123456 | 2023-04-06 | 2 | d | 2023-04-05 | 2 |
B123456 | 2023-04-07 | 1 | e | 2023-04-05 | 2 |
You can achieve this by using the groupby() method with the transform() method to create two new columns: min_login_date and max_order_items:
import pandas as pd
# set display options
pd.set_option('display.max_columns', 10)
pd.set_option('display.width', 100)
# create the dataframe
data = {'UserID': ['A123456', 'A123456', 'B123456', 'B123456', 'B123456'],
'LoginDate': ['2023-04-01', '2023-04-10', '2023-04-05', '2023-04-06', '2023-04-07'],
'PlacedOrderItems': [1, 0, 0, 2, 0],
'OtherColumn': ['a', 'b', 'c', 'd', 'e']}
df = pd.DataFrame(data)
# convert LoginDate to datetime format
df['LoginDate'] = pd.to_datetime(df['LoginDate'])
# group by UserID and transform to get the min LoginDate and max PlacedOrderItems
df['min_login_date'] = df.groupby('UserID')['LoginDate'].transform('min')
df['max_order_items'] = df.groupby('UserID')['PlacedOrderItems'].transform('max')
# group by all columns and take the first row of each group
group_cols = ['UserID', 'LoginDate', 'PlacedOrderItems', 'OtherColumn', 'min_login_date', 'max_order_items']
df = df.groupby(group_cols).first().reset_index()
# print the resulting dataframe
print(df.to_string(index=False))