I'm looking for an idiomatic way to calculate the weighted sum of a subset of the columns in a Polars DataFrame and add it to the DataFrame as new column. So let's say I want to multiply columns p1-p3 in the DataFrame below by the following weights and then sum them to create a new column.
weights = [7.4, 3.2, -0.13]
df = pl.DataFrame(
{
"id": [1, 2, 3, 4],
"p1": [44.3, 2.3, 2.4, 6.2],
"p2": [7.3, 8.4, 10.3, 8.443],
"p3": [70.3, 80.4, 100.3, 80.443],
"p4": [16.4, 18.2, 11.5, 18.34],
}
)
df
shape: (4, 5)
┌─────┬──────┬───────┬────────┬───────┐
│ id ┆ p1 ┆ p2 ┆ p3 ┆ p4 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 │
╞═════╪══════╪═══════╪════════╪═══════╡
│ 1 ┆ 44.3 ┆ 7.3 ┆ 70.3 ┆ 16.4 │
│ 2 ┆ 2.3 ┆ 8.4 ┆ 80.4 ┆ 18.2 │
│ 3 ┆ 2.4 ┆ 10.3 ┆ 100.3 ┆ 11.5 │
│ 4 ┆ 6.2 ┆ 8.443 ┆ 80.443 ┆ 18.34 │
└─────┴──────┴───────┴────────┴───────┘
I have come up with the following solution that calculates the correct answer, but I feel that there is likely a simpler, more idiomatic method that would enable me to select the columns of interest without needing to re-specify the df
within the with_columns
function. Any suggestions?
df.with_columns(
df.select(
pl.col(col) * pl.lit(weights[i])
for i, col in enumerate(["p1", "p2", "p3"])
)
.fold(lambda c1, c2: c1 + c2)
.alias("index"),
)
shape: (4, 6)
┌─────┬──────┬───────┬────────┬───────┬──────────┐
│ id ┆ p1 ┆ p2 ┆ p3 ┆ p4 ┆ index │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 │
╞═════╪══════╪═══════╪════════╪═══════╪══════════╡
│ 1 ┆ 44.3 ┆ 7.3 ┆ 70.3 ┆ 16.4 ┆ 342.041 │
│ 2 ┆ 2.3 ┆ 8.4 ┆ 80.4 ┆ 18.2 ┆ 33.448 │
│ 3 ┆ 2.4 ┆ 10.3 ┆ 100.3 ┆ 11.5 ┆ 37.681 │
│ 4 ┆ 6.2 ┆ 8.443 ┆ 80.443 ┆ 18.34 ┆ 62.44001 │
└─────┴──────┴───────┴────────┴───────┴──────────┘
sum_horizontal()
could help here.
We can do the following:
col_names = ["p1", "p2", "p3"]
weights = [7.4, 3.2, -0.13]
df.with_columns(
pl.sum_horizontal(
[pl.col(col_nm) * wgt
for col_nm, wgt in zip(col_names, weights)]
).alias("index")
)
shape: (4, 6)
┌─────┬──────┬───────┬────────┬───────┬──────────┐
│ id ┆ p1 ┆ p2 ┆ p3 ┆ p4 ┆ index │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 │
╞═════╪══════╪═══════╪════════╪═══════╪══════════╡
│ 1 ┆ 44.3 ┆ 7.3 ┆ 70.3 ┆ 16.4 ┆ 342.041 │
│ 2 ┆ 2.3 ┆ 8.4 ┆ 80.4 ┆ 18.2 ┆ 33.448 │
│ 3 ┆ 2.4 ┆ 10.3 ┆ 100.3 ┆ 11.5 ┆ 37.681 │
│ 4 ┆ 6.2 ┆ 8.443 ┆ 80.443 ┆ 18.34 ┆ 62.44001 │
└─────┴──────┴───────┴────────┴───────┴──────────┘
I used zip
instead of enumerate
.. but that's a stylistic choice. And I allowed Polars to broadcast the weight scalar to a literal, instead of explictly using pl.lit
.
Another trick that may help with readability: we can generate the list of expressions outside the with_columns/select contexts.
For example:
col_names = ["p1", "p2", "p3", "p4"]
weights = [7.4, 3.2, -0.13, 0.0]
wghtd_cols = [
pl.col(col_nm) * wgt
for col_nm, wgt in zip(col_names, weights)
if wgt != 0.0
]
df.with_columns(pl.sum_horizontal(wghtd_cols).alias("index"))
shape: (4, 6)
┌─────┬──────┬───────┬────────┬───────┬──────────┐
│ id ┆ p1 ┆ p2 ┆ p3 ┆ p4 ┆ index │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 │
╞═════╪══════╪═══════╪════════╪═══════╪══════════╡
│ 1 ┆ 44.3 ┆ 7.3 ┆ 70.3 ┆ 16.4 ┆ 342.041 │
│ 2 ┆ 2.3 ┆ 8.4 ┆ 80.4 ┆ 18.2 ┆ 33.448 │
│ 3 ┆ 2.4 ┆ 10.3 ┆ 100.3 ┆ 11.5 ┆ 37.681 │
│ 4 ┆ 6.2 ┆ 8.443 ┆ 80.443 ┆ 18.34 ┆ 62.44001 │
└─────┴──────┴───────┴────────┴───────┴──────────┘
This is particularly useful when one part of your code is generating the weights and/or selecting columns, and another part of your code is creating the resulting weighted sum column in the DataFrame.