I have an input dataframe that looks like this:
col0 col1 name
0 g0 100 n1
1 g0 120 n1
2 g1 40 n3
3 g1 30 n4
4 g2 10 n5
5 g2 100 n5
6 g2 250 n5
I want to obtain a new dataframe (or the same, it doesn't matter) where col1
and col2
contains respectively the minimum and maximum value for each tuple (col0, name)
. I am able to perform this with the following snippet:
import pandas as pd
data = {'col0': ["g0", "g0", "g1", "g1", "g2", "g2", "g2"], 'col1': [100, 120, 40, 30, 10, 100, 250], 'name': ["n1", "n1", "n3", "n4", "n5", "n5", "n5"]}
df = pd.DataFrame(data=data)
df_res = pd.DataFrame()
names = df["name"].unique()
for name in names:
literals = df[df["name"] == name]["col0"].unique()
for literal in literals:
min_val = df[(df["name"] == name) &
(df["col0"] == literal)]["col1"].min()
max_val = df[(df["name"] == name) &
(df["col0"] == literal)]["col1"].min()
elem = {"col0": literal, "col1": min_val, "col2": max_val,
"name": name}
df_res = df_res.append(elem, ignore_index=True)
Which gives the desired output stored on df_res
, which is as follows:
col0 col1 col2 name
0 g0 100.0 100.0 n1
1 g1 40.0 40.0 n3
2 g1 30.0 30.0 n4
3 g2 10.0 10.0 n5
The thing is that I have to make this transformation with a dataframe with 95K registers and my approach is extremely slow (it takes several minutes no finish).
Therefore, my question is: is there any way to optimize this operation?
You can use groupby.agg
and pass min
and max
to get the desired columns. To set custom names to aggregate columns, you can use named aggregation where you name the min column col1
and max column col2
:
out = df.groupby(['col0','name'], as_index=False).agg(col1=('col1','min'), col2=('col1','max'))
The same code in another way:
out = df.groupby(['col0','name'])['col1'].agg(['min','max']).set_axis(['col1','col2'], axis=1).reset_index()
Output:
col0 name col1 col2
0 g0 n1 100 120
1 g1 n3 40 40
2 g1 n4 30 30
3 g2 n5 10 250