I have a dataframe with multiple IDs and corresponding states. I want to analyze how the states have changed over time and present this information effectively.
Here is an example:
import polars as pl
df = pl.DataFrame({
"ID": [1, 2, 3],
"T0": ["A", "B", "C"],
"T1": ["B", "B", "A"],
})
One aproach it's to "concat" the columns , and then do a value_counts()
of the Change column
df = df.with_columns(
(pl.col("T0") + " -> " + pl.col("T1")).alias("Change")
)
However, there might be a better approach to this, or even a built-in function that can achieve what I need more efficiently.
Current Output:
shape: (3, 4)
┌─────┬─────┬─────┬────────┐
│ ID ┆ T0 ┆ T1 ┆ Change │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ str │
╞═════╪═════╪═════╪════════╡
│ 1 ┆ A ┆ B ┆ A -> B │
│ 2 ┆ B ┆ B ┆ B -> B │
│ 3 ┆ C ┆ A ┆ C -> A │
└─────┴─────┴─────┴────────┘
shape: (3, 2)
┌────────┬───────┐
│ Change ┆ count │
│ --- ┆ --- │
│ str ┆ u32 │
╞════════╪═══════╡
│ C -> A ┆ 1 │
│ B -> B ┆ 1 │
│ A -> B ┆ 1 │
└────────┴───────┘
Depending on the exact structure of the output you need, you can use pl.DataFrame.group_by()
and .len()
:
df.group_by("T0","T1").len("count")
shape: (3, 3)
┌─────┬─────┬───────┐
│ T0 ┆ T1 ┆ count │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ u32 │
╞═════╪═════╪═══════╡
│ C ┆ A ┆ 1 │
│ A ┆ B ┆ 1 │
│ B ┆ B ┆ 1 │
└─────┴─────┴───────┘
You can also concat columns afterwards if you want:
(
df
.group_by("T0","T1")
.len("count")
.select(
Change = pl.col.T0 + ' -> ' + pl.col.T1,
count = pl.col.count
)
)
shape: (3, 2)
┌────────┬───────┐
│ Change ┆ count │
│ --- ┆ --- │
│ str ┆ u32 │
╞════════╪═══════╡
│ A -> B ┆ 1 │
│ B -> B ┆ 1 │
│ C -> A ┆ 1 │
└────────┴───────┘
If you want to use pl.Expr.value_counts()
, you can first combine columns into pl.struct()
:
df.select(pl.struct("T0","T1").value_counts())
shape: (3, 1)
┌───────────────┐
│ T0 │
│ --- │
│ struct[2] │
╞═══════════════╡
│ {{"A","B"},1} │
│ {{"C","A"},1} │
│ {{"B","B"},1} │
└───────────────┘