I'm trying to use polars
dataframe where I would like to select the first
and last
row per group. Here is a simple example selecting the first row per group:
import polars as pl
df = pl.DataFrame(
{
"a": [1, 2, 2, 3, 4, 5],
"b": [0.5, 0.5, 4, 10, 14, 13],
"c": [True, True, True, False, False, True],
"d": ["Apple", "Apple", "Apple", "Banana", "Banana", "Banana"],
}
)
result = df.group_by("d", maintain_order=True).first()
print(result)
Output:
shape: (2, 4)
┌────────┬─────┬──────┬───────┐
│ d ┆ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ f64 ┆ bool │
╞════════╪═════╪══════╪═══════╡
│ Apple ┆ 1 ┆ 0.5 ┆ true │
│ Banana ┆ 3 ┆ 10.0 ┆ false │
└────────┴─────┴──────┴───────┘
This works good and we can use .last
to do it for the last row. But how can we combine these in one group_by
?
You could use agg
, you will have to add a suffix
(or prefix
) to differentiate the columns names:
result = (df.group_by('d', maintain_order=True)
.agg(pl.all().first().name.suffix('_first'),
pl.all().last().name.suffix('_last'))
)
Output:
┌────────┬─────────┬─────────┬─────────┬────────┬────────┬────────┐
│ d ┆ a_first ┆ b_first ┆ c_first ┆ a_last ┆ b_last ┆ c_last │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ f64 ┆ bool ┆ i64 ┆ f64 ┆ bool │
╞════════╪═════════╪═════════╪═════════╪════════╪════════╪════════╡
│ Apple ┆ 1 ┆ 0.5 ┆ true ┆ 2 ┆ 4.0 ┆ true │
│ Banana ┆ 3 ┆ 10.0 ┆ false ┆ 5 ┆ 13.0 ┆ true │
└────────┴─────────┴─────────┴─────────┴────────┴────────┴────────┘
If you want multiple rows, then you would need to concat
:
g = df.group_by('d', maintain_order=True)
result = pl.concat([g.first(), g.last()]).sort(by='d', maintain_order=True)
Output:
┌────────┬─────┬──────┬───────┐
│ d ┆ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ f64 ┆ bool │
╞════════╪═════╪══════╪═══════╡
│ Apple ┆ 1 ┆ 0.5 ┆ true │
│ Apple ┆ 2 ┆ 4.0 ┆ true │
│ Banana ┆ 3 ┆ 10.0 ┆ false │
│ Banana ┆ 5 ┆ 13.0 ┆ true │
└────────┴─────┴──────┴───────┘
Or using filter
with int_range
+over
:
result = df.filter((pl.int_range(pl.len()).over('d') == 0)
|(pl.int_range(pl.len(), 0, -1).over('d') == 1)
)
Output:
┌─────┬──────┬───────┬────────┐
│ a ┆ b ┆ c ┆ d │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ bool ┆ str │
╞═════╪══════╪═══════╪════════╡
│ 1 ┆ 0.5 ┆ true ┆ Apple │
│ 2 ┆ 4.0 ┆ true ┆ Apple │
│ 3 ┆ 10.0 ┆ false ┆ Banana │
│ 5 ┆ 13.0 ┆ true ┆ Banana │
└─────┴──────┴───────┴────────┘