I have a simple pl.DataFrame
with a number of columns that only contain boolean values.
import polars as pl
df = pl.DataFrame(
{"s1": [True, True, False], "s2": [False, True, True], "s3": [False, False, False]}
)
shape: (3, 3)
┌───────┬───────┬───────┐
│ s1 ┆ s2 ┆ s3 │
│ --- ┆ --- ┆ --- │
│ bool ┆ bool ┆ bool │
╞═══════╪═══════╪═══════╡
│ true ┆ false ┆ false │
│ true ┆ true ┆ false │
│ false ┆ true ┆ false │
└───────┴───────┴───────┘
I need to add another column that contains lists of varying length. A list in any individual row should contain the column name where the values of the columns S1
, s2
, and s3
have a True
value.
Here's what I am actually looking for:
shape: (3, 4)
┌───────┬───────┬───────┬──────────────┐
│ s1 ┆ s2 ┆ s3 │ list │
│ --- ┆ --- ┆ --- │ --- │
│ bool ┆ bool ┆ bool │ list[str] │
╞═══════╪═══════╪═══════╡══════════════╡
│ true ┆ false ┆ false │ ["s1"] │
│ true ┆ true ┆ false │ ["s1", "s2"] │
│ false ┆ true ┆ false │ ["s2"] │
└───────┴───────┴───────┴──────────────┘
You could build a list of when/then expressions and then remove the nulls.
df.with_columns(
pl.concat_list(
pl.when(col).then(pl.lit(col)) for col in df.columns
)
.list.drop_nulls()
.alias("list")
)
shape: (3, 4)
┌───────┬───────┬───────┬──────────────┐
│ s1 ┆ s2 ┆ s3 ┆ list │
│ --- ┆ --- ┆ --- ┆ --- │
│ bool ┆ bool ┆ bool ┆ list[str] │
╞═══════╪═══════╪═══════╪══════════════╡
│ true ┆ false ┆ false ┆ ["s1"] │
│ true ┆ true ┆ false ┆ ["s1", "s2"] │
│ false ┆ true ┆ false ┆ ["s2"] │
└───────┴───────┴───────┴──────────────┘
If "raw performance" is of concern, it can be done at the frame level.
You can reshape with .unpivot()
and .group_by
to create the lists.
(df.with_row_index()
.unpivot(index="index")
.filter(pl.col.value)
.group_by("index", maintain_order=True)
.agg(pl.col.variable.alias("list"))
)
shape: (3, 2)
┌───────┬──────────────┐
│ index ┆ list │
│ --- ┆ --- │
│ u32 ┆ list[str] │
╞═══════╪══════════════╡
│ 0 ┆ ["s1"] │
│ 1 ┆ ["s1", "s2"] │
│ 2 ┆ ["s2"] │
└───────┴──────────────┘
As we've maintained the order, we can horizontally .concat()
to combine them.
pl.concat(
[
df,
df.with_row_index()
.unpivot(index="index")
.filter(pl.col.value)
.group_by("index", maintain_order=True)
.agg(pl.col.variable.alias("list"))
.drop("index") # optional
],
how = "horizontal"
)
shape: (3, 4)
┌───────┬───────┬───────┬──────────────┐
│ s1 ┆ s2 ┆ s3 ┆ list │
│ --- ┆ --- ┆ --- ┆ --- │
│ bool ┆ bool ┆ bool ┆ list[str] │
╞═══════╪═══════╪═══════╪══════════════╡
│ true ┆ false ┆ false ┆ ["s1"] │
│ true ┆ true ┆ false ┆ ["s1", "s2"] │
│ false ┆ true ┆ false ┆ ["s2"] │
└───────┴───────┴───────┴──────────────┘
As a basic comparison.
bigger_df = df.sample(2_000_000, with_replacement=True)
Name | Time |
---|---|
concat_list | 1.4s |
unpivot + concat | 0.2s |