Create list column out of column names

I have a simple pl.DataFrame with a number of columns that only contain boolean values.

import polars as pl

df = pl.DataFrame(
    {"s1": [True, True, False], "s2": [False, True, True], "s3": [False, False, False]}
)

shape: (3, 3)
┌───────┬───────┬───────┐
│ s1    ┆ s2    ┆ s3    │
│ ---   ┆ ---   ┆ ---   │
│ bool  ┆ bool  ┆ bool  │
╞═══════╪═══════╪═══════╡
│ true  ┆ false ┆ false │
│ true  ┆ true  ┆ false │
│ false ┆ true  ┆ false │
└───────┴───────┴───────┘

I need to add another column that contains lists of varying length. A list in any individual row should contain the column name where the values of the columns S1, s2, and s3 have a True value.

Here's what I am actually looking for:

shape: (3, 4)
┌───────┬───────┬───────┬──────────────┐
│ s1    ┆ s2    ┆ s3    │ list         │
│ ---   ┆ ---   ┆ ---   │ ---          │
│ bool  ┆ bool  ┆ bool  │ list[str]    │
╞═══════╪═══════╪═══════╡══════════════╡
│ true  ┆ false ┆ false │ ["s1"]       │
│ true  ┆ true  ┆ false │ ["s1", "s2"] │
│ false ┆ true  ┆ false │ ["s2"]       │
└───────┴───────┴───────┴──────────────┘

Solution

List API

You could build a list of when/then expressions and then remove the nulls.

df.with_columns(
   pl.concat_list(
      pl.when(col).then(pl.lit(col)) for col in df.columns
   )
   .list.drop_nulls()
   .alias("list")
)

shape: (3, 4)
┌───────┬───────┬───────┬──────────────┐
│ s1    ┆ s2    ┆ s3    ┆ list         │
│ ---   ┆ ---   ┆ ---   ┆ ---          │
│ bool  ┆ bool  ┆ bool  ┆ list[str]    │
╞═══════╪═══════╪═══════╪══════════════╡
│ true  ┆ false ┆ false ┆ ["s1"]       │
│ true  ┆ true  ┆ false ┆ ["s1", "s2"] │
│ false ┆ true  ┆ false ┆ ["s2"]       │
└───────┴───────┴───────┴──────────────┘

Unpivot

If "raw performance" is of concern, it can be done at the frame level.

You can reshape with .unpivot() and .group_by to create the lists.

(df.with_row_index()
   .unpivot(index="index")
   .filter(pl.col.value) 
   .group_by("index", maintain_order=True)
   .agg(pl.col.variable.alias("list"))
)

shape: (3, 2)
┌───────┬──────────────┐
│ index ┆ list         │
│ ---   ┆ ---          │
│ u32   ┆ list[str]    │
╞═══════╪══════════════╡
│ 0     ┆ ["s1"]       │
│ 1     ┆ ["s1", "s2"] │
│ 2     ┆ ["s2"]       │
└───────┴──────────────┘

As we've maintained the order, we can horizontally .concat() to combine them.

pl.concat(
    [
        df,
        df.with_row_index()
          .unpivot(index="index")
          .filter(pl.col.value)
          .group_by("index", maintain_order=True)
          .agg(pl.col.variable.alias("list"))
          .drop("index") # optional
    ], 
    how = "horizontal"
)

shape: (3, 4)
┌───────┬───────┬───────┬──────────────┐
│ s1    ┆ s2    ┆ s3    ┆ list         │
│ ---   ┆ ---   ┆ ---   ┆ ---          │
│ bool  ┆ bool  ┆ bool  ┆ list[str]    │
╞═══════╪═══════╪═══════╪══════════════╡
│ true  ┆ false ┆ false ┆ ["s1"]       │
│ true  ┆ true  ┆ false ┆ ["s1", "s2"] │
│ false ┆ true  ┆ false ┆ ["s2"]       │
└───────┴───────┴───────┴──────────────┘

Timing

As a basic comparison.

bigger_df = df.sample(2_000_000, with_replacement=True)

Name	Time
concat_list	1.4s
unpivot + concat	0.2s