Search code examples
pythonpython-polars

Create list column out of column names


I have a simple pl.DataFrame with a number of columns that only contain boolean values.

import polars as pl

df = pl.DataFrame(
    {"s1": [True, True, False], "s2": [False, True, True], "s3": [False, False, False]}
)

shape: (3, 3)
┌───────┬───────┬───────┐
│ s1    ┆ s2    ┆ s3    │
│ ---   ┆ ---   ┆ ---   │
│ bool  ┆ bool  ┆ bool  │
╞═══════╪═══════╪═══════╡
│ true  ┆ false ┆ false │
│ true  ┆ true  ┆ false │
│ false ┆ true  ┆ false │
└───────┴───────┴───────┘

I need to add another column that contains lists of varying length. A list in any individual row should contain the column name where the values of the columns S1, s2, and s3 have a True value.

Here's what I am actually looking for:

shape: (3, 4)
┌───────┬───────┬───────┬──────────────┐
│ s1    ┆ s2    ┆ s3    │ list         │
│ ---   ┆ ---   ┆ ---   │ ---          │
│ bool  ┆ bool  ┆ bool  │ list[str]    │
╞═══════╪═══════╪═══════╡══════════════╡
│ true  ┆ false ┆ false │ ["s1"]       │
│ true  ┆ true  ┆ false │ ["s1", "s2"] │
│ false ┆ true  ┆ false │ ["s2"]       │
└───────┴───────┴───────┴──────────────┘

Solution

  • List API

    You could build a list of when/then expressions and then remove the nulls.

    df.with_columns(
       pl.concat_list(
          pl.when(col).then(pl.lit(col)) for col in df.columns
       )
       .list.drop_nulls()
       .alias("list")
    )
    
    shape: (3, 4)
    ┌───────┬───────┬───────┬──────────────┐
    │ s1    ┆ s2    ┆ s3    ┆ list         │
    │ ---   ┆ ---   ┆ ---   ┆ ---          │
    │ bool  ┆ bool  ┆ bool  ┆ list[str]    │
    ╞═══════╪═══════╪═══════╪══════════════╡
    │ true  ┆ false ┆ false ┆ ["s1"]       │
    │ true  ┆ true  ┆ false ┆ ["s1", "s2"] │
    │ false ┆ true  ┆ false ┆ ["s2"]       │
    └───────┴───────┴───────┴──────────────┘
    

    Unpivot

    If "raw performance" is of concern, it can be done at the frame level.

    You can reshape with .unpivot() and .group_by to create the lists.

    (df.with_row_index()
       .unpivot(index="index")
       .filter(pl.col.value) 
       .group_by("index", maintain_order=True)
       .agg(pl.col.variable.alias("list"))
    )
    
    shape: (3, 2)
    ┌───────┬──────────────┐
    │ index ┆ list         │
    │ ---   ┆ ---          │
    │ u32   ┆ list[str]    │
    ╞═══════╪══════════════╡
    │ 0     ┆ ["s1"]       │
    │ 1     ┆ ["s1", "s2"] │
    │ 2     ┆ ["s2"]       │
    └───────┴──────────────┘
    

    As we've maintained the order, we can horizontally .concat() to combine them.

    pl.concat(
        [
            df,
            df.with_row_index()
              .unpivot(index="index")
              .filter(pl.col.value)
              .group_by("index", maintain_order=True)
              .agg(pl.col.variable.alias("list"))
              .drop("index") # optional
        ], 
        how = "horizontal"
    )
    
    shape: (3, 4)
    ┌───────┬───────┬───────┬──────────────┐
    │ s1    ┆ s2    ┆ s3    ┆ list         │
    │ ---   ┆ ---   ┆ ---   ┆ ---          │
    │ bool  ┆ bool  ┆ bool  ┆ list[str]    │
    ╞═══════╪═══════╪═══════╪══════════════╡
    │ true  ┆ false ┆ false ┆ ["s1"]       │
    │ true  ┆ true  ┆ false ┆ ["s1", "s2"] │
    │ false ┆ true  ┆ false ┆ ["s2"]       │
    └───────┴───────┴───────┴──────────────┘
    

    Timing

    As a basic comparison.

    bigger_df = df.sample(2_000_000, with_replacement=True)
    
    Name Time
    concat_list 1.4s
    unpivot + concat 0.2s