I need to compute the percentage of positive values in the value
column grouped by the group
column.
import polars as pl
df = pl.DataFrame(
{
"group": ["A", "A", "A", "A", "A", "B", "B", "B", "B", "B"],
"value": [2, -1, 3, 1, -2, 1, 2, -1, 3, 2],
}
)
shape: (10, 2)
┌───────┬───────┐
│ group ┆ value │
│ --- ┆ --- │
│ str ┆ i64 │
╞═══════╪═══════╡
│ A ┆ 2 │
│ A ┆ -1 │
│ A ┆ 3 │
│ A ┆ 1 │
│ A ┆ -2 │
│ B ┆ 1 │
│ B ┆ 2 │
│ B ┆ -1 │
│ B ┆ 3 │
│ B ┆ 2 │
└───────┴───────┘
In group A
there are 3 out of 5 positive values (60%), while in column B
there are 4 out 5 positive values (80%).
Here's the expected dataframe.
┌────────┬──────────────────┐
│ group ┆ positive_percent │
│ --- ┆ --- │
│ str ┆ f64 │
╞════════╪══════════════════╡
│ A ┆ 0.6 │
│ B ┆ 0.8 │
└────────┴──────────────────┘
You could use a custom group_by.agg
with Expr.ge
and Expr.mean
. This will convert the values to False
/True
depending on the sign, then compute the proportion of True
by taking the mean
:
df.group_by('group').agg(positive_percent=pl.col('value').ge(0).mean())
Output:
┌───────┬──────────────────┐
│ group ┆ positive_percent │
│ --- ┆ --- │
│ str ┆ f64 │
╞═══════╪══════════════════╡
│ A ┆ 0.6 │
│ B ┆ 0.8 │
└───────┴──────────────────┘
Intermediates:
┌───────┬───────┬───────┬──────┐
│ group ┆ value ┆ ge(0) ┆ mean │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ bool ┆ f64 │
╞═══════╪═══════╪═══════╪══════╡
│ A ┆ 2 ┆ true ┆ 0.6 │ #
│ A ┆ -1 ┆ false ┆ 0.6 │ # group A
│ A ┆ 3 ┆ true ┆ 0.6 │ # (True+False+True+True+False)/5
│ A ┆ 1 ┆ true ┆ 0.6 │ # = 3/5 = 0.6
│ A ┆ -2 ┆ false ┆ 0.6 │ #
│ B ┆ 1 ┆ true ┆ 0.8 │
│ B ┆ 2 ┆ true ┆ 0.8 │
│ B ┆ -1 ┆ false ┆ 0.8 │
│ B ┆ 3 ┆ true ┆ 0.8 │
│ B ┆ 2 ┆ true ┆ 0.8 │
└───────┴───────┴───────┴──────┘