Search code examples
pythonpython-polars

Weighted sum of a column in Polars dataframe


I have a Polars dataframe and I want to calculate a weighted sum of a particular column and the weights is just the positive integer sequence, e.g., 1, 2, 3, ...

For example, assume I have the following dataframe.

import polars as pl

df = pl.DataFrame({"a": [2, 4, 2, 1, 2, 1, 3, 6, 7, 5]})

The result I want is

218 (= 2*1 + 4*2 + 2*3 + 1*4 + ... + 7*9 + 5*10)

How can I achieve this by using only general polars expressions? (The reason I want to use just polars expressions to solve the problem is for speed considerations)

Note: The example is just a simple example where there are just 10 numbers there, but in general, the dataframe height can be any positive number.

Thanks for your help..


Solution

  • Such weighted sum can be calculated using dot product (.dot() method). To generate range (weights) from 1 to n, you can use pl.int_range(1, n+1).

    If you just need to calculate result of weighted sum:

    df.select(
        pl.col("a").dot(pl.int_range(1, pl.clen()+1))
    ) #.item() - to get value (218)
    

    Keep dataframe

    df.with_columns(
        pl.col("a").dot(pl.int_range(1, pl.len()+1)).alias("weighted_sum")
    )
    
    ┌─────┬──────────────┐
    │ a   ┆ weighted_sum │
    │ --- ┆ ---          │
    │ i64 ┆ i64          │
    ╞═════╪══════════════╡
    │ 2   ┆ 218          │
    │ 4   ┆ 218          │
    │ ... ┆ ...          │
    │ 3   ┆ 218          │
    │ 5   ┆ 218          │
    └─────┴──────────────┘
    

    In group_by context

    df.group_by("some_cat_col", maintain_order=True).agg(
        pl.col("a").dot(pl.int_range(1, pl.len()+1))
    )