Search code examples
python-polars

How to create a weighted sum of some columns in a Polars DataFrame?


I'm looking for an idiomatic way to calculate the weighted sum of a subset of the columns in a Polars DataFrame and add it to the DataFrame as new column. So let's say I want to multiply columns p1-p3 in the DataFrame below by the following weights and then sum them to create a new column.

weights = [7.4, 3.2, -0.13]

df = pl.DataFrame(
    {
        "id": [1, 2, 3, 4],
        "p1": [44.3, 2.3, 2.4, 6.2],
        "p2": [7.3, 8.4, 10.3, 8.443],
        "p3": [70.3, 80.4, 100.3, 80.443],
        "p4": [16.4, 18.2, 11.5, 18.34],
    }
)
df
shape: (4, 5)
┌─────┬──────┬───────┬────────┬───────┐
│ id  ┆ p1   ┆ p2    ┆ p3     ┆ p4    │
│ --- ┆ ---  ┆ ---   ┆ ---    ┆ ---   │
│ i64 ┆ f64  ┆ f64   ┆ f64    ┆ f64   │
╞═════╪══════╪═══════╪════════╪═══════╡
│ 1   ┆ 44.3 ┆ 7.3   ┆ 70.3   ┆ 16.4  │
│ 2   ┆ 2.3  ┆ 8.4   ┆ 80.4   ┆ 18.2  │
│ 3   ┆ 2.4  ┆ 10.3  ┆ 100.3  ┆ 11.5  │
│ 4   ┆ 6.2  ┆ 8.443 ┆ 80.443 ┆ 18.34 │
└─────┴──────┴───────┴────────┴───────┘

I have come up with the following solution that calculates the correct answer, but I feel that there is likely a simpler, more idiomatic method that would enable me to select the columns of interest without needing to re-specify the df within the with_columns function. Any suggestions?

df.with_columns(
    df.select(
        pl.col(col) * pl.lit(weights[i])
        for i, col in enumerate(["p1", "p2", "p3"])
    )
    .fold(lambda c1, c2: c1 + c2)
    .alias("index"),
)
shape: (4, 6)
┌─────┬──────┬───────┬────────┬───────┬──────────┐
│ id  ┆ p1   ┆ p2    ┆ p3     ┆ p4    ┆ index    │
│ --- ┆ ---  ┆ ---   ┆ ---    ┆ ---   ┆ ---      │
│ i64 ┆ f64  ┆ f64   ┆ f64    ┆ f64   ┆ f64      │
╞═════╪══════╪═══════╪════════╪═══════╪══════════╡
│ 1   ┆ 44.3 ┆ 7.3   ┆ 70.3   ┆ 16.4  ┆ 342.041  │
│ 2   ┆ 2.3  ┆ 8.4   ┆ 80.4   ┆ 18.2  ┆ 33.448   │
│ 3   ┆ 2.4  ┆ 10.3  ┆ 100.3  ┆ 11.5  ┆ 37.681   │
│ 4   ┆ 6.2  ┆ 8.443 ┆ 80.443 ┆ 18.34 ┆ 62.44001 │
└─────┴──────┴───────┴────────┴───────┴──────────┘

Solution

  • sum_horizontal() could help here.

    We can do the following:

    col_names = ["p1", "p2", "p3"]
    weights = [7.4, 3.2, -0.13]
    df.with_columns(
        pl.sum_horizontal(
            [pl.col(col_nm) * wgt
             for col_nm, wgt in zip(col_names, weights)]
        ).alias("index")
    )
    
    shape: (4, 6)
    ┌─────┬──────┬───────┬────────┬───────┬──────────┐
    │ id  ┆ p1   ┆ p2    ┆ p3     ┆ p4    ┆ index    │
    │ --- ┆ ---  ┆ ---   ┆ ---    ┆ ---   ┆ ---      │
    │ i64 ┆ f64  ┆ f64   ┆ f64    ┆ f64   ┆ f64      │
    ╞═════╪══════╪═══════╪════════╪═══════╪══════════╡
    │ 1   ┆ 44.3 ┆ 7.3   ┆ 70.3   ┆ 16.4  ┆ 342.041  │
    │ 2   ┆ 2.3  ┆ 8.4   ┆ 80.4   ┆ 18.2  ┆ 33.448   │
    │ 3   ┆ 2.4  ┆ 10.3  ┆ 100.3  ┆ 11.5  ┆ 37.681   │
    │ 4   ┆ 6.2  ┆ 8.443 ┆ 80.443 ┆ 18.34 ┆ 62.44001 │
    └─────┴──────┴───────┴────────┴───────┴──────────┘
    

    I used zip instead of enumerate .. but that's a stylistic choice. And I allowed Polars to broadcast the weight scalar to a literal, instead of explictly using pl.lit.

    Another trick that may help with readability: we can generate the list of expressions outside the with_columns/select contexts.

    For example:

    col_names = ["p1", "p2", "p3", "p4"]
    weights = [7.4, 3.2, -0.13, 0.0]
    wghtd_cols = [
        pl.col(col_nm) * wgt
        for col_nm, wgt in zip(col_names, weights)
        if wgt != 0.0
    ]
    
    df.with_columns(pl.sum_horizontal(wghtd_cols).alias("index"))
    
    shape: (4, 6)
    ┌─────┬──────┬───────┬────────┬───────┬──────────┐
    │ id  ┆ p1   ┆ p2    ┆ p3     ┆ p4    ┆ index    │
    │ --- ┆ ---  ┆ ---   ┆ ---    ┆ ---   ┆ ---      │
    │ i64 ┆ f64  ┆ f64   ┆ f64    ┆ f64   ┆ f64      │
    ╞═════╪══════╪═══════╪════════╪═══════╪══════════╡
    │ 1   ┆ 44.3 ┆ 7.3   ┆ 70.3   ┆ 16.4  ┆ 342.041  │
    │ 2   ┆ 2.3  ┆ 8.4   ┆ 80.4   ┆ 18.2  ┆ 33.448   │
    │ 3   ┆ 2.4  ┆ 10.3  ┆ 100.3  ┆ 11.5  ┆ 37.681   │
    │ 4   ┆ 6.2  ┆ 8.443 ┆ 80.443 ┆ 18.34 ┆ 62.44001 │
    └─────┴──────┴───────┴────────┴───────┴──────────┘
    

    This is particularly useful when one part of your code is generating the weights and/or selecting columns, and another part of your code is creating the resulting weighted sum column in the DataFrame.