Search code examples
python-polars

How to compute different statistics altogether using select/filter/group_by/value_counts in Polars


I have the following dataframe with three columns Cost, State and StartTime.

import polars as pl
from datetime import datetime

df = pl.DataFrame({
    "Cost": [1, 2, 3],
    "State": ["KO", "OK", "OK"],
    "StartTime": [datetime(2023, 6, 23), datetime(2024, 6, 22), datetime(2024, 6, 21)],
})
shape: (3, 3)
┌──────┬───────┬─────────────────────┐
│ Cost ┆ State ┆ StartTime           │
│ ---  ┆ ---   ┆ ---                 │
│ i64  ┆ str   ┆ datetime[μs]        │
╞══════╪═══════╪═════════════════════╡
│ 1    ┆ KO    ┆ 2023-06-23 00:00:00 │
│ 2    ┆ OK    ┆ 2024-06-22 00:00:00 │
│ 3    ┆ OK    ┆ 2024-06-21 00:00:00 │
└──────┴───────┴─────────────────────┘

I would like to compute:

  1. the mean and total cost - once for the entire dataframe and once for only the current month. I'd like the result to be stored in a single dataframe.
  2. The daily frequency (for instance by computing how many rows have the same date)
  3. Also counting the percentages of "State" equal to "OK" and equal to "KO".

How can this be achieved in python polars without computing each value separately, if possible? Thanks!

Here is an example with my attempt summary statistics.

this_year = datetime.now().year
this_month = datetime.now().month
filter_this_year_month_exp = (pl.col('StartTime').dt.year() == this_year) & (pl.col('StartTime').dt.month() == this_month)
df = df.with_columns(
    pl.col('StartTime').dt.date().alias('StartDate')
)
stats = df.select(
    pl.col("Cost").sum().alias("TotalCost"),
    pl.col("Cost").mean().alias("AverageCost"),
    pl.col("Cost").filter(filter_this_year_month_exp).sum().alias("TotalCostCurrentMonth"),
    pl.col("Cost").filter(filter_this_year_month_exp).mean().alias("AverageCostCurrentMonth"),
)
frequency_per_day_in_this_month = df.filter(
    filter_this_year_month_exp
).group_by('StartDate').len().select("len").mean().item()
# this returns just the counts, should be divided by len(df) to compute the percentage
df.select(
    pl.col("State").value_counts()
).unnest("State")

Solution

  • What you have so far is a great start. You can use window functions to calculate something over a group while staying in the select context. You can also use polars.len to get the current number of rows in the context.

    Starting from df = df.with_columns(pl.col('StartTime').dt.date().alias('StartDate')) onwards, this will perform all your desired computations in a single context (which is definitely preferred wherever possible).

    df.select(
        pl.col("Cost").sum().alias("TotalCost"),
        pl.col("Cost").mean().alias("AverageCost"),
        pl.col("Cost").filter(filter_this_year_month_exp).sum().alias("TotalCostCurrentMonth"),
        pl.col("Cost").filter(filter_this_year_month_exp).mean().alias("AverageCostCurrentMonth"),
        # Count rows per StartDate
        pl.len().over(pl.col("StartDate")).alias("StartDateFreq"),
        # Count rows per State divided by total rows
        (pl.len().over(pl.col("State")) / pl.len()).alias("StatePercentage"),
        # Count rows where State is "OK" divided by total rows
        (
            pl.col("State").filter(pl.col("State") == pl.lit("OK")).count() /
            pl.len()
        ).alias("PercentageStateIsOK")
    )
    
    shape: (3, 7)
    ┌───────────┬─────────────┬───────────────────────┬─────────────────────────┬───────────────┬─────────────────┬─────────────────────┐
    │ TotalCost ┆ AverageCost ┆ TotalCostCurrentMonth ┆ AverageCostCurrentMonth ┆ StartDateFreq ┆ StatePercentage ┆ PercentageStateIsOK │
    │ ---       ┆ ---         ┆ ---                   ┆ ---                     ┆ ---           ┆ ---             ┆ ---                 │
    │ i64       ┆ f64         ┆ i64                   ┆ f64                     ┆ u32           ┆ f64             ┆ f64                 │
    ╞═══════════╪═════════════╪═══════════════════════╪═════════════════════════╪═══════════════╪═════════════════╪═════════════════════╡
    │ 6         ┆ 2.0         ┆ 5                     ┆ 2.5                     ┆ 1             ┆ 0.333333        ┆ 0.666667            │
    │ 6         ┆ 2.0         ┆ 5                     ┆ 2.5                     ┆ 1             ┆ 0.666667        ┆ 0.666667            │
    │ 6         ┆ 2.0         ┆ 5                     ┆ 2.5                     ┆ 1             ┆ 0.666667        ┆ 0.666667            │
    └───────────┴─────────────┴───────────────────────┴─────────────────────────┴───────────────┴─────────────────┴─────────────────────┘
    

    Edit in response to comment:

    Here is an example that computes the percentage of rows for the current month where the state is "OK"

    (
        # Count of rows where State is "OK" AND it is the current month divided by
        pl.col("State").filter(pl.col("State") == pl.lit("OK"), filter_this_year_month_exp).count() /
        # Count of rows for the current month
        pl.col("StartTime").filter(filter_this_year_month_exp).count()
    ).alias("PercentageStateIsOKThisMonth")
    

    And here is a section of the User Guide explaining contexts. DataFrame methods such as select, with_columns, filter, group_by, etc. are all contexts.

    And by way of example for pl.len

    df = pl.DataFrame({"a": [1, 2, 3]})
    print(    
        df.with_columns(first_context=pl.len())
        .filter(pl.col("a") > 1)
        # pl.len() now knows there are only 2 rows remaining
        .select("first_context", second_context=pl.len())
    )
    # shape: (2, 2)
    # ┌───────────────┬────────────────┐
    # │ first_context ┆ second_context │
    # │ ---           ┆ ---            │
    # │ u32           ┆ u32            │
    # ╞═══════════════╪════════════════╡
    # │ 3             ┆ 2              │
    # │ 3             ┆ 2              │
    # └───────────────┴────────────────┘
    
    # don't do this!
    df = (        
        df.with_columns(first_context=df.height)
        .filter(pl.col("a") > 1)
        # df.height stays the same as `df` has not yet been assigned back to
        .select("first_context", second_context=df.height)
    )
    print(df)
    # shape: (2, 2)
    # ┌───────────────┬────────────────┐
    # │ first_context ┆ second_context │
    # │ ---           ┆ ---            │
    # │ i32           ┆ i32            │
    # ╞═══════════════╪════════════════╡
    # │ 3             ┆ 3              │
    # │ 3             ┆ 3              │
    # └───────────────┴────────────────┘