Search code examples
python-polars

Python Polars groupby variance


I would like to compute the groupby variance of my polars dataframe. Maybe the reason is obvious but I don't know why it does not exists in the groupby object namespace. Is there a workaround maybe?

df.group_by("group_id", maintain_order=True).var()


Solution

  • You can always use pl.all to obtain your desired statistics for groups. For example:

    import polars as pl
    import numpy as np
    
    nbr_rows_per_group = 1_000
    nbr_groups = 3
    
    rng = np.random.default_rng(1)
    
    df = pl.DataFrame(
        {
            "group" : list(range(0, nbr_groups)) * nbr_rows_per_group,
            "col1": rng.normal(0, 1, nbr_groups * nbr_rows_per_group),
            "col2": rng.normal(0, 1, nbr_groups * nbr_rows_per_group),
        }
    )
    
    (
        df
        .group_by('group')
        .agg(
            pl.all().var().name.suffix('_var'),
            pl.all().mean().name.suffix('_mean'),
            pl.all().skew().name.suffix('_skew'),
        )
    )
    
    shape: (3, 7)
    ┌───────┬──────────┬──────────┬───────────┬───────────┬───────────┬───────────┐
    │ group ┆ col1_var ┆ col2_var ┆ col1_mean ┆ col2_mean ┆ col1_skew ┆ col2_skew │
    │ ---   ┆ ---      ┆ ---      ┆ ---       ┆ ---       ┆ ---       ┆ ---       │
    │ i64   ┆ f64      ┆ f64      ┆ f64       ┆ f64       ┆ f64       ┆ f64       │
    ╞═══════╪══════════╪══════════╪═══════════╪═══════════╪═══════════╪═══════════╡
    │ 0     ┆ 0.999802 ┆ 0.99401  ┆ 0.017574  ┆ 0.021156  ┆ -0.042408 ┆ 0.0102    │
    │ 2     ┆ 1.031637 ┆ 1.029593 ┆ -0.053874 ┆ -0.037097 ┆ 0.004183  ┆ 0.080086  │
    │ 1     ┆ 0.941347 ┆ 1.006852 ┆ 0.029232  ┆ -0.023855 ┆ 0.049269  ┆ 0.074515  │
    └───────┴──────────┴──────────┴───────────┴───────────┴───────────┴───────────┘