I have some polars DataFrames over which I want to compute some row-wise statistics.
For some there is a .list.func
function which exists (eg list.mean
), however, for those which don't have a dedicated function I believe I must use list.eval
.
For the following example data:
df = pl.DataFrame({
'a': [1,10,1,.1,.1, np.NAN],
'b': [2, 8,1,.2, np.NAN,np.NAN],
'c': [3, 6,2,.3,.2, np.NAN],
'd': [4, 4,3,.4, np.NAN,np.NAN],
'e': [5, 2,3,.5,.3, np.NAN],
}, strict=False)
I have managed to come up with the following expression.
It seems that list.eval
returns a list (which I suppose is more generic) so I need to call .explode
on the resulting 1-element list to get back a single value.
The resulting column takes the name of the first column, so I then need to call .alias
to give it a more meaningful name.
df.select(
pl.concat_list(
pl.all().fill_nan(None)
)
.list.eval(pl.element().quantile(0.25))
.explode()
.alias('q1')
)
Is this the recommended way of computing row-wise?
I would unpivot and join here. It should be faster than .list.eval
plus it let's you more easily add other row wise aggregations. Note I've added q2,q3,q4 to the agg
(
(_df:=df.with_row_index('i'))
.join(
_df
.unpivot(index='i')
.group_by('i')
.agg(
pl.col('value').quantile(x).alias(q)
for q,x in {'q1':0.25,'q2':0.50, 'q3':0.75, 'q4':1}.items()
),
on='i'
)
.sort('i')
.drop('i')
)
shape: (6, 9)
┌──────┬─────┬─────┬─────┬───┬─────┬─────┬─────┬──────┐
│ a ┆ b ┆ c ┆ d ┆ … ┆ q1 ┆ q2 ┆ q3 ┆ q4 │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ f64 ┆ f64 ┆ ┆ f64 ┆ f64 ┆ f64 ┆ f64 │
╞══════╪═════╪═════╪═════╪═══╪═════╪═════╪═════╪══════╡
│ 1.0 ┆ 2.0 ┆ 3.0 ┆ 4.0 ┆ … ┆ 2.0 ┆ 3.0 ┆ 4.0 ┆ 5.0 │
│ 10.0 ┆ 8.0 ┆ 6.0 ┆ 4.0 ┆ … ┆ 4.0 ┆ 6.0 ┆ 8.0 ┆ 10.0 │
│ 1.0 ┆ 1.0 ┆ 2.0 ┆ 3.0 ┆ … ┆ 1.0 ┆ 2.0 ┆ 3.0 ┆ 3.0 │
│ 0.1 ┆ 0.2 ┆ 0.3 ┆ 0.4 ┆ … ┆ 0.2 ┆ 0.3 ┆ 0.4 ┆ 0.5 │
│ 0.1 ┆ NaN ┆ 0.2 ┆ NaN ┆ … ┆ 0.2 ┆ 0.3 ┆ NaN ┆ NaN │
│ NaN ┆ NaN ┆ NaN ┆ NaN ┆ … ┆ NaN ┆ NaN ┆ NaN ┆ NaN │
└──────┴─────┴─────┴─────┴───┴─────┴─────┴─────┴──────┘
I used the walrus operator to create _df
so as to not have to invoke .with_row_index
twice. If you prefer you can just do df=df.with_row_index('i')
first instead.