I am currently trying to achieve a polars group_by while keeping other columns than the ones in the group_by
function.
Here is an example of an input data frame that I have.
df = pl.from_repr("""
┌─────┬─────┬─────┬─────┐
│ SRC ┆ TGT ┆ IT ┆ Cd │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ f64 │
╞═════╪═════╪═════╪═════╡
│ 1 ┆ 1 ┆ 2 ┆ 3.0 │
│ 2 ┆ 1 ┆ 2 ┆ 4.0 │
│ 3 ┆ 1 ┆ 2 ┆ 3.0 │
│ 3 ┆ 2 ┆ 1 ┆ 8.0 │
└─────┴─────┴─────┴─────┘
""")
I want to group by ['TGT', 'IT']
using min('Cd')
, which is the following code :
df.group_by('TGT', 'IT').agg(pl.col('Cd').min())
With this code line, I obtain the following dataframe.
┌─────┬─────┬─────┐
│ TGT ┆ IT ┆ Cd │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ f64 │
╞═════╪═════╪═════╡
│ 1 ┆ 2 ┆ 3.0 │
│ 2 ┆ 1 ┆ 8.0 │
└─────┴─────┴─────┘
And here is the dataframe I would rather want
┌─────┬─────┬─────┬─────┐
│ SRC ┆ TGT ┆ IT ┆ Cd │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ f64 │
╞═════╪═════╪═════╪═════╡
│ 1 ┆ 1 ┆ 2 ┆ 3.0 │
│ 3 ┆ 2 ┆ 1 ┆ 8.0 │
└─────┴─────┴─────┴─────┘
I thing I could achieve this by joining the first dataframe on the grouped one using ['TGT', 'IT', 'Cd']
, and then delete the doubled rows, as I only want one (and any) 'SRC'
for each ('TGT', 'IT')
couple. But I wanted to know if there is a more straightforward way to do it, especially by keeping the 'SRC'
column during the group_by
Thanks by advance
# Your data
data = {
"SRC": [1, 2, 3, 3],
"TGT": [1, 1, 1, 2],
"IT": [2, 2, 2, 1],
"Cd": [3.0, 4.0, 3.0, 8.0]
}
df = pl.DataFrame(data)
# Perform the group_by and aggregation
result = (
df.group_by('TGT', 'IT', maintain_order=True)
.agg(
pl.col('SRC').first(),
pl.col('Cd').min()
)
.select('SRC', 'TGT', 'IT', 'Cd') # to reorder columns
)
print(result)