Search code examples
python-polars

Create a new column based partially on other column names


I am new to both Polars and Python in general. I have a somewhat unusual problem I could use some help with. I have a dataframe with 50 plus columns that are 0/1. I need to create a new column that contains comma separated list of each column that contains a 1 but using part of the column name. If hccx = 1 then append x to a string column. A simplified example:

df=pl.DataFrame(
    {'id':[1,2,3], 'hcc1':[0,1,1],'hcc2':[0,0,1],'hcc5':[0,1,1],'hcc8':[1,0,0]}
)
shape: (3, 5)
┌─────┬──────┬──────┬──────┬──────┐
│ id  ┆ hcc1 ┆ hcc2 ┆ hcc5 ┆ hcc8 │
│ --- ┆ ---  ┆ ---  ┆ ---  ┆ ---  │
│ i64 ┆ i64  ┆ i64  ┆ i64  ┆ i64  │
╞═════╪══════╪══════╪══════╪══════╡
│ 1   ┆ 0    ┆ 0    ┆ 0    ┆ 1    │
│ 2   ┆ 1    ┆ 0    ┆ 1    ┆ 0    │
│ 3   ┆ 1    ┆ 1    ┆ 1    ┆ 0    │
└─────┴──────┴──────┴──────┴──────┘

I want to create a new column (string type), hccall, that looks like the following:

id hccall
1 8
2 1,5
3 1,2,5

I imagine some type of list comprehension looping over columns that start with 'hcc' would work but I'm kind of stuck. I can create a loop but not sure how to append to the column from within the loop. Any slick ideas?


Solution

  • I think the easiest option is to first unpivot your dataframe so that you a row per id per column in your dataframe. You can than filter the rows that are equal to 1. You can than aggregate the last letter of the original rows into a list. With list.join you can combine the list into a string separated by commas

    (
        df.unpivot(index="id")
        .filter(pl.col("value") == 1)
        .group_by("id")
        .agg(pl.col("variable").str.slice(3))
        .with_columns(pl.col("variable").list.join(","))
    )
    
    shape: (3, 2)
    ┌─────┬──────────┐
    │ id  ┆ variable │
    │ --- ┆ ---      │
    │ i64 ┆ str      │
    ╞═════╪══════════╡
    │ 2   ┆ 1,5      │
    │ 3   ┆ 1,2,5    │
    │ 1   ┆ 8        │
    └─────┴──────────┘