Search code examples
pythonpandasgroup-bydataframe

pandas group by with reset index


Given I have following pandas dataframe

UserID LoginDate PlacedOrderItems Other column
A123456 2023-04-01 1 a
A123456 2023-04-10 0 b
B123456 2023-04-05 0 c
B123456 2023-04-06 2 d
B123456 2023-04-07 0 e

I want to use group by UserID and get min of LoginDate and max of PlacedOrderItems added to each row of my datafarme. I tried Group by but turns the column names into a weird multi level indexing.

But What I want is something like below where I preserve the original dataframe and just get two new columns added.

UserID LoginDate PlacedOrderItems Other column min_login_date max_order_items
A123456 2023-04-01 1 a 2023-04-01 1
A123456 2023-04-10 0 b 2023-04-01 1
B123456 2023-04-05 0 c 2023-04-05 2
B123456 2023-04-06 2 d 2023-04-05 2
B123456 2023-04-07 1 e 2023-04-05 2

Solution

  • You can achieve this by using the groupby() method with the transform() method to create two new columns: min_login_date and max_order_items:

    import pandas as pd
    
    
    # set display options
    pd.set_option('display.max_columns', 10)
    pd.set_option('display.width', 100)
    
    # create the dataframe
    data = {'UserID': ['A123456', 'A123456', 'B123456', 'B123456', 'B123456'],
            'LoginDate': ['2023-04-01', '2023-04-10', '2023-04-05', '2023-04-06', '2023-04-07'],
            'PlacedOrderItems': [1, 0, 0, 2, 0],
            'OtherColumn': ['a', 'b', 'c', 'd', 'e']}
    df = pd.DataFrame(data)
    
    # convert LoginDate to datetime format
    df['LoginDate'] = pd.to_datetime(df['LoginDate'])
    
    # group by UserID and transform to get the min LoginDate and max PlacedOrderItems
    df['min_login_date'] = df.groupby('UserID')['LoginDate'].transform('min')
    df['max_order_items'] = df.groupby('UserID')['PlacedOrderItems'].transform('max')
    
    # group by all columns and take the first row of each group
    group_cols = ['UserID', 'LoginDate', 'PlacedOrderItems', 'OtherColumn', 'min_login_date', 'max_order_items']
    df = df.groupby(group_cols).first().reset_index()
    
    # print the resulting dataframe
    print(df.to_string(index=False))