I'm working with a Polars DataFrame in Python, where I have a column containing lists of values. I need to calculate the Z-scores for each list using pre-computed mean and standard deviation values. Here’s a sample of my DataFrame:
import polars as pl
data = {
"transcript_id": ["ENST00000711184.1"],
"OE": [[3.933402, 1.057907, None, 3.116513]],
"mean_OE": [11.882091],
"std_OE": [3.889974],
}
df_human = pl.DataFrame(data)
For each list in the OE column, I want to subtract the mean (mean_OE) and divide by the standard deviation (std_OE) to obtain the Z-scores. I also want to handle None values in the lists by leaving them as None in the Z-scores list.
How can I correctly apply the Z-score calculation to each list while keeping None values intact?
Thanks in advance for any guidance!
Since the last couple releases, and especially since Polars release 1.10.0, arithmetic between list columns and non-list columns simplified a lot.
If you are interested in the usual definition of the Z-score (using the summary statistics of the actual list data), the following can be used.
df_human.select(
(pl.col("OE") - pl.col("OE").list.mean()) / pl.col("OE").list.std()
)
shape: (1, 1)
┌─────────────────────────────────────┐
│ OE │
│ --- │
│ list[f64] │
╞═════════════════════════════════════╡
│ [1.230795, -1.6447, null, 0.413906] │
└─────────────────────────────────────┘
If you want to compute the Z-score explicitly using the mean_OE
and std_OE
columns, you can now use them directly.
df_human.select((pl.col("OE") - pl.col("mean_OE")) / pl.col("std_OE"))
shape: (1, 1)
┌─────────────────────────────────────────┐
│ OE │
│ --- │
│ list[f64] │
╞═════════════════════════════════════════╡
│ [-2.043378, -2.782585, null, -2.253377] │
└─────────────────────────────────────────┘