Search code examples
pythonaggregationpython-polarsstatistical-mode

How to calculate mode when using python-polars in aggregation


I am involving in a data-mining project and have some problems whiling doing feature engineering. One of my goal is to aggregate data according to the primary key, and to produce new columns. So I write this:

df = df.group_by("case_id").agg(date_exprs(df,df_base))

def date_expr(df, df_base):
    # Join df and df_base on 'case_id' column
    df = df.join(df_base[['case_id','date_decision']], on="case_id", how="left")

    for col in df.columns:
        if col[-1] in ("D",):
            df = df.with_columns(pl.col(col) - pl.col("date_decision"))
            df = df.with_columns(pl.col(col).dt.total_days())

    cols = [col for col in df.columns if col[-1] in ("D",)]

    # Generate expressions for max, min, mean, mode, and std of date differences
    expr_max = [pl.max(col).alias(f"max_{col}") for col in cols]
    expr_min = [pl.min(col).alias(f"min_{col}") for col in cols]
    expr_mean = [pl.mean(col).alias(f"mean_{col}") for col in cols]
    expr_mode = [pl.mode(col).alias(f"mode_{col}") for col in cols]
    expr_std = [pl.std(col).alias(f"std_{col}") for col in cols]

    return expr_max + expr_min + expr_mean + expr_mode + expr_std

However, there goes an error: AttributeError: module 'polars' has no attribute 'mode'.

I looked up document of polars on github and found there was no Dataframe.mode() but Series.mode(), which I thought might be the reason of error? I referred to chatGPT, which could not help because these codes with error were just from it.

Besides, here is only an example of dealing with float type. What about string type? Can I also apply your method?

I am looking forward to your kind help!!


Solution

  • In your example it fails because there's no syntactic sugar for Expr.mode() as it is for aggregate functions (for example, pl.max() is a syntactic sugar for Expr.max(). The mode() is actually not aggregation function but computation one, which means it just calculates the most occuring value(s) within the column.

    So, given the DataFrame like this:

    df = (
        pl.DataFrame({
            'aD' : [200, 200, 300, 400, 1, 3],
            'bD': [2, 3, 6, 4, 5, 1],
            'case_id': [1,1,1,2,2,2]
        })
    )
    
    ┌─────┬─────┬─────────┐
    │ aD  ┆ bD  ┆ case_id │
    │ --- ┆ --- ┆ ---     │
    │ i64 ┆ i64 ┆ i64     │
    ╞═════╪═════╪═════════╡
    │ 200 ┆ 2   ┆ 1       │
    │ 200 ┆ 3   ┆ 1       │
    │ 300 ┆ 6   ┆ 1       │
    │ 400 ┆ 4   ┆ 2       │
    │ 1   ┆ 5   ┆ 2       │
    │ 3   ┆ 1   ┆ 2       │
    └─────┴─────┴─────────┘
    

    you can calculate mode() with the following code:

    df.with_columns(
        pl.col('aD').mode(),
        pl.col('bD').mode()
    )
    
    ┌─────┬─────┬─────────┐
    │ aD  ┆ bD  ┆ case_id │
    │ --- ┆ --- ┆ ---     │
    │ i64 ┆ i64 ┆ i64     │
    ╞═════╪═════╪═════════╡
    │ 200 ┆ 1   ┆ 1       │
    │ 200 ┆ 5   ┆ 1       │
    │ 200 ┆ 6   ┆ 1       │
    │ 200 ┆ 4   ┆ 2       │
    │ 200 ┆ 2   ┆ 2       │
    │ 200 ┆ 3   ┆ 2       │
    └─────┴─────┴─────────┘
    

    given that, we still can calculate the results you need. I'll simplify you function a bit by using selectors and Expr.prefix():

    import polars.selectors as cs
    
    def date_expr():
        # Generate expressions for max, min, mean, mode, and std of date differences
        expr_max = cs.ends_with('D').max().name.prefix("max_")
        expr_min = cs.ends_with('D').min().name.prefix("min_")
        expr_mean = cs.ends_with('D').mean().name.prefix("mean_")
        expr_mode = cs.ends_with('D').mode().first().name.prefix("mode_")
        expr_std = cs.ends_with('D').std().name.prefix("std_")
    
        return expr_max, expr_min, expr_mean, expr_std, expr_mode
    
    df.group_by("case_id").agg(date_expr())
    
    ┌─────────┬────────┬────────┬────────┬───┬────────────┬──────────┬─────────┬─────────┐
    │ case_id ┆ max_aD ┆ max_bD ┆ min_aD ┆ … ┆ std_aD     ┆ std_bD   ┆ mode_aD ┆ mode_bD │
    │ ---     ┆ ---    ┆ ---    ┆ ---    ┆   ┆ ---        ┆ ---      ┆ ---     ┆ ---     │
    │ i64     ┆ i64    ┆ i64    ┆ i64    ┆   ┆ f64        ┆ f64      ┆ i64     ┆ i64     │
    ╞═════════╪════════╪════════╪════════╪═══╪════════════╪══════════╪═════════╪═════════╡
    │ 2       ┆ 400    ┆ 5      ┆ 1      ┆ … ┆ 229.787583 ┆ 2.081666 ┆ 3       ┆ 4       │
    │ 1       ┆ 300    ┆ 6      ┆ 200    ┆ … ┆ 57.735027  ┆ 2.081666 ┆ 200     ┆ 2       │
    └─────────┴────────┴────────┴────────┴───┴────────────┴──────────┴─────────┴─────────┘
    

    note that I've used Expr.first() to the one of the values for mode - as there might be different ones with the same frequency. You can use list expressions to specify which one you'd like to get.