I'd like to do the following in python using the polars library:
Input:
df = pl.from_repr("""
┌──────┬────────┐
│ Name ┆ Number │
│ --- ┆ --- │
│ str ┆ i64 │
╞══════╪════════╡
│ Mr.A ┆ 1 │
│ Mr.A ┆ 4 │
│ Mr.A ┆ 5 │
│ Mr.B ┆ 3 │
│ Mr.B ┆ 5 │
│ Mr.B ┆ 6 │
│ Mr.B ┆ 10 │
└──────┴────────┘
""")
Output:
shape: (7, 3)
┌──────┬────────┬──────────┐
│ Name ┆ Number ┆ average │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ f64 │
╞══════╪════════╪══════════╡
│ Mr.A ┆ 1 ┆ 0.0 │
│ Mr.A ┆ 4 ┆ 1.0 │
│ Mr.A ┆ 5 ┆ 2.5 │
│ Mr.B ┆ 3 ┆ 0.0 │
│ Mr.B ┆ 5 ┆ 3.0 │
│ Mr.B ┆ 6 ┆ 4.0 │
│ Mr.B ┆ 10 ┆ 4.666667 │
└──────┴────────┴──────────┘
That is to say:
Example:
Mr. A started off with average=0 and the Number=1.
Then, Mr. A has the Number=4, thus it took the average of the previous entry (1/1 data=1)
Then, Mr. A has the Number=5, thus the previous average was: (1+4) / (2 data) = 5/2 = 2.5
And so on
I've tried the rolling mean function (using a Polars Dataframe, df), however, I'm restricted by rolling_mean's window size (i.e. it calculates only the past 2 entries, plus it averages the current entry as well; I want to average only the previous entries)
Does anyone have an idea? Much appreciated!:
df.group_by("Name").agg(pl.col("Number").rolling_mean(window_size=2))
cum_sum()
and cum_count()
to calculate cumulative average.shift()
so it's calculated for 'previous' row.over()
to do the whole operation within Name
.df.with_columns(
(pl.col.Number.cum_sum() / pl.col.Number.cum_count())
.shift(1, fill_value=0)
.over("Name")
.alias("average")
)
┌──────┬────────┬──────────┐
│ Name ┆ Number ┆ average │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ f64 │
╞══════╪════════╪══════════╡
│ Mr.A ┆ 1 ┆ 0.0 │
│ Mr.A ┆ 4 ┆ 1.0 │
│ Mr.A ┆ 5 ┆ 2.5 │
│ Mr.B ┆ 3 ┆ 0.0 │
│ Mr.B ┆ 5 ┆ 3.0 │
│ Mr.B ┆ 6 ┆ 4.0 │
│ Mr.B ┆ 10 ┆ 4.666667 │
└──────┴────────┴──────────┘