Search code examples
pythonpandasdataframepandas-groupby

Optimize min-max and groupby query for Pandas Dataframe


I have an input dataframe that looks like this:

  col0  col1 name
0   g0   100   n1
1   g0   120   n1
2   g1    40   n3
3   g1    30   n4
4   g2    10   n5
5   g2   100   n5
6   g2   250   n5

I want to obtain a new dataframe (or the same, it doesn't matter) where col1 and col2 contains respectively the minimum and maximum value for each tuple (col0, name). I am able to perform this with the following snippet:

import pandas as pd
data = {'col0': ["g0", "g0", "g1", "g1", "g2", "g2", "g2"], 'col1': [100, 120, 40, 30, 10, 100, 250], 'name': ["n1", "n1", "n3", "n4", "n5", "n5", "n5"]}
df = pd.DataFrame(data=data)

df_res = pd.DataFrame()

names = df["name"].unique()

for name in names:
    literals = df[df["name"] == name]["col0"].unique()

    for literal in literals:
        min_val = df[(df["name"] == name) &
                     (df["col0"] == literal)]["col1"].min()
        max_val = df[(df["name"] == name) &
                     (df["col0"] == literal)]["col1"].min()

        elem = {"col0": literal, "col1": min_val, "col2": max_val,
                "name": name}

        df_res = df_res.append(elem, ignore_index=True)

Which gives the desired output stored on df_res, which is as follows:

  col0   col1   col2 name
0   g0  100.0  100.0   n1
1   g1   40.0   40.0   n3
2   g1   30.0   30.0   n4
3   g2   10.0   10.0   n5

The thing is that I have to make this transformation with a dataframe with 95K registers and my approach is extremely slow (it takes several minutes no finish).

Therefore, my question is: is there any way to optimize this operation?


Solution

  • You can use groupby.agg and pass min and max to get the desired columns. To set custom names to aggregate columns, you can use named aggregation where you name the min column col1 and max column col2:

    out = df.groupby(['col0','name'], as_index=False).agg(col1=('col1','min'), col2=('col1','max'))
    

    The same code in another way:

    out = df.groupby(['col0','name'])['col1'].agg(['min','max']).set_axis(['col1','col2'], axis=1).reset_index()
    

    Output:

      col0 name  col1  col2
    0   g0   n1   100   120
    1   g1   n3    40    40
    2   g1   n4    30    30
    3   g2   n5    10   250