I have the following dataframe with three columns Cost, State and StartTime.
import polars as pl
from datetime import datetime
df = pl.DataFrame({
"Cost": [1, 2, 3],
"State": ["KO", "OK", "OK"],
"StartTime": [datetime(2023, 6, 23), datetime(2024, 6, 22), datetime(2024, 6, 21)],
})
shape: (3, 3)
┌──────┬───────┬─────────────────────┐
│ Cost ┆ State ┆ StartTime │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ datetime[μs] │
╞══════╪═══════╪═════════════════════╡
│ 1 ┆ KO ┆ 2023-06-23 00:00:00 │
│ 2 ┆ OK ┆ 2024-06-22 00:00:00 │
│ 3 ┆ OK ┆ 2024-06-21 00:00:00 │
└──────┴───────┴─────────────────────┘
I would like to compute:
How can this be achieved in python polars without computing each value separately, if possible? Thanks!
Here is an example with my attempt summary statistics.
this_year = datetime.now().year
this_month = datetime.now().month
filter_this_year_month_exp = (pl.col('StartTime').dt.year() == this_year) & (pl.col('StartTime').dt.month() == this_month)
df = df.with_columns(
pl.col('StartTime').dt.date().alias('StartDate')
)
stats = df.select(
pl.col("Cost").sum().alias("TotalCost"),
pl.col("Cost").mean().alias("AverageCost"),
pl.col("Cost").filter(filter_this_year_month_exp).sum().alias("TotalCostCurrentMonth"),
pl.col("Cost").filter(filter_this_year_month_exp).mean().alias("AverageCostCurrentMonth"),
)
frequency_per_day_in_this_month = df.filter(
filter_this_year_month_exp
).group_by('StartDate').len().select("len").mean().item()
# this returns just the counts, should be divided by len(df) to compute the percentage
df.select(
pl.col("State").value_counts()
).unnest("State")
What you have so far is a great start. You can use window functions to calculate something over a group while staying in the select
context. You can also use polars.len to get the current number of rows in the context.
Starting from df = df.with_columns(pl.col('StartTime').dt.date().alias('StartDate'))
onwards, this will perform all your desired computations in a single context (which is definitely preferred wherever possible).
df.select(
pl.col("Cost").sum().alias("TotalCost"),
pl.col("Cost").mean().alias("AverageCost"),
pl.col("Cost").filter(filter_this_year_month_exp).sum().alias("TotalCostCurrentMonth"),
pl.col("Cost").filter(filter_this_year_month_exp).mean().alias("AverageCostCurrentMonth"),
# Count rows per StartDate
pl.len().over(pl.col("StartDate")).alias("StartDateFreq"),
# Count rows per State divided by total rows
(pl.len().over(pl.col("State")) / pl.len()).alias("StatePercentage"),
# Count rows where State is "OK" divided by total rows
(
pl.col("State").filter(pl.col("State") == pl.lit("OK")).count() /
pl.len()
).alias("PercentageStateIsOK")
)
shape: (3, 7)
┌───────────┬─────────────┬───────────────────────┬─────────────────────────┬───────────────┬─────────────────┬─────────────────────┐
│ TotalCost ┆ AverageCost ┆ TotalCostCurrentMonth ┆ AverageCostCurrentMonth ┆ StartDateFreq ┆ StatePercentage ┆ PercentageStateIsOK │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ i64 ┆ f64 ┆ u32 ┆ f64 ┆ f64 │
╞═══════════╪═════════════╪═══════════════════════╪═════════════════════════╪═══════════════╪═════════════════╪═════════════════════╡
│ 6 ┆ 2.0 ┆ 5 ┆ 2.5 ┆ 1 ┆ 0.333333 ┆ 0.666667 │
│ 6 ┆ 2.0 ┆ 5 ┆ 2.5 ┆ 1 ┆ 0.666667 ┆ 0.666667 │
│ 6 ┆ 2.0 ┆ 5 ┆ 2.5 ┆ 1 ┆ 0.666667 ┆ 0.666667 │
└───────────┴─────────────┴───────────────────────┴─────────────────────────┴───────────────┴─────────────────┴─────────────────────┘
Edit in response to comment:
Here is an example that computes the percentage of rows for the current month where the state is "OK"
(
# Count of rows where State is "OK" AND it is the current month divided by
pl.col("State").filter(pl.col("State") == pl.lit("OK"), filter_this_year_month_exp).count() /
# Count of rows for the current month
pl.col("StartTime").filter(filter_this_year_month_exp).count()
).alias("PercentageStateIsOKThisMonth")
And here is a section of the User Guide explaining contexts. DataFrame methods such as select
, with_columns
, filter
, group_by
, etc. are all contexts.
And by way of example for pl.len
df = pl.DataFrame({"a": [1, 2, 3]})
print(
df.with_columns(first_context=pl.len())
.filter(pl.col("a") > 1)
# pl.len() now knows there are only 2 rows remaining
.select("first_context", second_context=pl.len())
)
# shape: (2, 2)
# ┌───────────────┬────────────────┐
# │ first_context ┆ second_context │
# │ --- ┆ --- │
# │ u32 ┆ u32 │
# ╞═══════════════╪════════════════╡
# │ 3 ┆ 2 │
# │ 3 ┆ 2 │
# └───────────────┴────────────────┘
# don't do this!
df = (
df.with_columns(first_context=df.height)
.filter(pl.col("a") > 1)
# df.height stays the same as `df` has not yet been assigned back to
.select("first_context", second_context=df.height)
)
print(df)
# shape: (2, 2)
# ┌───────────────┬────────────────┐
# │ first_context ┆ second_context │
# │ --- ┆ --- │
# │ i32 ┆ i32 │
# ╞═══════════════╪════════════════╡
# │ 3 ┆ 3 │
# │ 3 ┆ 3 │
# └───────────────┴────────────────┘