In a Polars dataframe, I know that I can aggregate over a group of rows that have the same value in a column using for example .groupby("first_name")
.agg([...]).
How can I aggregate over all rows in a dataframe?
For example, I'd like to get the mean of all values in a column.
As suggested by @jqurious, you can use mean() to obtain the mean, without adding an aggregation.
Examples.
import polars as pl
# sample dataframe
df = pl.DataFrame({
'text':['a','a','b','b'],
'value':[1,2,3,4]
})
shape: (4, 2)
┌──────┬───────┐
│ text ┆ value │
│ --- ┆ --- │
│ str ┆ i64 │
╞══════╪═══════╡
│ a ┆ 1 │
│ a ┆ 2 │
│ b ┆ 3 │
│ b ┆ 4 │
└──────┴───────┘
# add the mean with select
df.select(
value_mean = pl.mean('value')
)
shape: (1, 1)
┌────────────┐
│ value_mean │
│ --- │
│ f64 │
╞════════════╡
│ 2.5 │
└────────────┘
# add the mean with with_columns
df.with_columns(
value_mean = pl.mean('value')
)
shape: (4, 3)
┌──────┬───────┬────────────┐
│ text ┆ value ┆ value_mean │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ f64 │
╞══════╪═══════╪════════════╡
│ a ┆ 1 ┆ 2.5 │
│ a ┆ 2 ┆ 2.5 │
│ b ┆ 3 ┆ 2.5 │
│ b ┆ 4 ┆ 2.5 │
└──────┴───────┴────────────┘
Using select, only the columns specified in select will show up in the result. Using with_columns, all columns will show up in the result plus any column you add or modify.
For that, the result of select is one row while the result of with_columns is the 4 rows of the sample dataframe.