Search code examples
pythonpython-polarsexploded

How to extract values based on column names and put it in another column in polars?


I would like to fill a value in a column based on another columns' name, in the Polars library from python (I obtained the following DF by exploding my variables' column names):

Input:

df = pl.from_repr("""
┌────────┬─────────┬────────┬─────┬──────────┐
│ Name   ┆ Average ┆ Median ┆ Q1  ┆ Variable │
│ ---    ┆ ---     ┆ ---    ┆ --- ┆ ---      │
│ str    ┆ i64     ┆ i64    ┆ i64 ┆ str      │
╞════════╪═════════╪════════╪═════╪══════════╡
│ Apple  ┆ 2       ┆ 3      ┆ 4   ┆ Average  │
│ Apple  ┆ 2       ┆ 3      ┆ 4   ┆ Median   │
│ Apple  ┆ 2       ┆ 3      ┆ 4   ┆ Q1       │
│ Banana ┆ 1       ┆ 5      ┆ 10  ┆ Average  │
│ Banana ┆ 1       ┆ 5      ┆ 10  ┆ Median   │
│ Banana ┆ 1       ┆ 5      ┆ 10  ┆ Q1       │
└────────┴─────────┴────────┴─────┴──────────┘
""")

Expected output:

shape: (6, 6)
┌────────┬─────────┬────────┬─────┬──────────┬───────┐
│ Name   ┆ Average ┆ Median ┆ Q1  ┆ Variable ┆ value │
│ ---    ┆ ---     ┆ ---    ┆ --- ┆ ---      ┆ ---   │
│ str    ┆ i64     ┆ i64    ┆ i64 ┆ str      ┆ i64   │
╞════════╪═════════╪════════╪═════╪══════════╪═══════╡
│ Apple  ┆ 2       ┆ 3      ┆ 4   ┆ Average  ┆ 2     │
│ Apple  ┆ 2       ┆ 3      ┆ 4   ┆ Median   ┆ 3     │
│ Apple  ┆ 2       ┆ 3      ┆ 4   ┆ Q1       ┆ 4     │
│ Banana ┆ 1       ┆ 5      ┆ 10  ┆ Average  ┆ 1     │
│ Banana ┆ 1       ┆ 5      ┆ 10  ┆ Median   ┆ 5     │
│ Banana ┆ 1       ┆ 5      ┆ 10  ┆ Q1       ┆ 10    │
└────────┴─────────┴────────┴─────┴──────────┴───────┘

I have tried:

df = df.with_columns(value = pl.col(f"{pl.col.variable}"))

But that does not work because polars perceives the argument as a function (?). Does anyone know how to do this?

Note: I have also tried to transpose the dataframe, which, not only was that computationally expensive, also did not work! Because it would transpose the DF into a 5-rows-long DF. What I need is a (Name * Number of Variables)-rows-long DF.

That is, for example, I have 3 different names (say, Apple, Banana, and Dragonfruit), and I have 3 variables (Average, Median, Q1), then my DF should be 9-rows-long!


Solution

  • You can use

    • when/then() to check whether the value of the column Variable is the same as the column name.
    • coalesce() to choose first non-empty result.
    df.with_columns(
        value = pl.coalesce(
            pl.when(pl.col.Variable == col).then(pl.col(col))
            for col in df["Variable"].unique()
        )
    )
    
    ┌────────┬─────────┬────────┬─────┬──────────┬───────┐
    │ Name   ┆ Average ┆ Median ┆ Q1  ┆ Variable ┆ value │
    │ ---    ┆ ---     ┆ ---    ┆ --- ┆ ---      ┆ ---   │
    │ str    ┆ i64     ┆ i64    ┆ i64 ┆ str      ┆ i64   │
    ╞════════╪═════════╪════════╪═════╪══════════╪═══════╡
    │ Apple  ┆ 2       ┆ 3      ┆ 4   ┆ Average  ┆ 2     │
    │ Apple  ┆ 2       ┆ 3      ┆ 4   ┆ Median   ┆ 3     │
    │ Apple  ┆ 2       ┆ 3      ┆ 4   ┆ Q1       ┆ 4     │
    │ Banana ┆ 3       ┆ 5      ┆ 10  ┆ Average  ┆ 3     │
    │ Banana ┆ 3       ┆ 5      ┆ 10  ┆ Median   ┆ 5     │
    │ Banana ┆ 3       ┆ 5      ┆ 10  ┆ Q1       ┆ 10    │
    └────────┴─────────┴────────┴─────┴──────────┴───────┘