Search code examples
pythondataframepython-polars

How to Calculate Z-Scores for a List of Values in Polars DataFrame


I'm working with a Polars DataFrame in Python, where I have a column containing lists of values. I need to calculate the Z-scores for each list using pre-computed mean and standard deviation values. Here’s a sample of my DataFrame:

import polars as pl

data = {
    "transcript_id": ["ENST00000711184.1"],
    "OE": [[3.933402, 1.057907, None, 3.116513]],
    "mean_OE": [11.882091],
    "std_OE": [3.889974],
}

df_human = pl.DataFrame(data)

For each list in the OE column, I want to subtract the mean (mean_OE) and divide by the standard deviation (std_OE) to obtain the Z-scores. I also want to handle None values in the lists by leaving them as None in the Z-scores list.

How can I correctly apply the Z-score calculation to each list while keeping None values intact?

Thanks in advance for any guidance!


Solution

  • Since the last couple releases, and especially since Polars release 1.10.0, arithmetic between list columns and non-list columns simplified a lot.

    If you are interested in the usual definition of the Z-score (using the summary statistics of the actual list data), the following can be used.

    df_human.select(
        (pl.col("OE") - pl.col("OE").list.mean()) / pl.col("OE").list.std()
    )
    
    shape: (1, 1)
    ┌─────────────────────────────────────┐
    │ OE                                  │
    │ ---                                 │
    │ list[f64]                           │
    ╞═════════════════════════════════════╡
    │ [1.230795, -1.6447, null, 0.413906] │
    └─────────────────────────────────────┘
    

    If you want to compute the Z-score explicitly using the mean_OE and std_OE columns, you can now use them directly.

    df_human.select((pl.col("OE") - pl.col("mean_OE")) / pl.col("std_OE"))
    
    shape: (1, 1)
    ┌─────────────────────────────────────────┐
    │ OE                                      │
    │ ---                                     │
    │ list[f64]                               │
    ╞═════════════════════════════════════════╡
    │ [-2.043378, -2.782585, null, -2.253377] │
    └─────────────────────────────────────────┘