Search code examples
pythonpython-polars

polars: compute row-wise quantile over DataFrame


I have some polars DataFrames over which I want to compute some row-wise statistics.

For some there is a .list.func function which exists (eg list.mean), however, for those which don't have a dedicated function I believe I must use list.eval.

For the following example data:

df = pl.DataFrame({
    'a': [1,10,1,.1,.1,     np.NAN],
    'b': [2, 8,1,.2, np.NAN,np.NAN],
    'c': [3, 6,2,.3,.2,     np.NAN],
    'd': [4, 4,3,.4, np.NAN,np.NAN],
    'e': [5, 2,3,.5,.3,     np.NAN],
}, strict=False)

I have managed to come up with the following expression.

It seems that list.eval returns a list (which I suppose is more generic) so I need to call .explode on the resulting 1-element list to get back a single value.

The resulting column takes the name of the first column, so I then need to call .alias to give it a more meaningful name.

df.select(
    pl.concat_list(
        pl.all().fill_nan(None)
    )
    .list.eval(pl.element().quantile(0.25))
    .explode()
    .alias('q1')
)

Is this the recommended way of computing row-wise?


Solution

  • I would unpivot and join here. It should be faster than .list.eval plus it let's you more easily add other row wise aggregations. Note I've added q2,q3,q4 to the agg

    (
        (_df:=df.with_row_index('i'))
        .join(
            _df
            .unpivot(index='i')
            .group_by('i')
            .agg(
                pl.col('value').quantile(x).alias(q)
                for q,x in {'q1':0.25,'q2':0.50, 'q3':0.75, 'q4':1}.items()
                ), 
            on='i'
            )
        .sort('i')
        .drop('i')
        )
    
    shape: (6, 9)
    ┌──────┬─────┬─────┬─────┬───┬─────┬─────┬─────┬──────┐
    │ a    ┆ b   ┆ c   ┆ d   ┆ … ┆ q1  ┆ q2  ┆ q3  ┆ q4   │
    │ ---  ┆ --- ┆ --- ┆ --- ┆   ┆ --- ┆ --- ┆ --- ┆ ---  │
    │ f64  ┆ f64 ┆ f64 ┆ f64 ┆   ┆ f64 ┆ f64 ┆ f64 ┆ f64  │
    ╞══════╪═════╪═════╪═════╪═══╪═════╪═════╪═════╪══════╡
    │ 1.0  ┆ 2.0 ┆ 3.0 ┆ 4.0 ┆ … ┆ 2.0 ┆ 3.0 ┆ 4.0 ┆ 5.0  │
    │ 10.0 ┆ 8.0 ┆ 6.0 ┆ 4.0 ┆ … ┆ 4.0 ┆ 6.0 ┆ 8.0 ┆ 10.0 │
    │ 1.0  ┆ 1.0 ┆ 2.0 ┆ 3.0 ┆ … ┆ 1.0 ┆ 2.0 ┆ 3.0 ┆ 3.0  │
    │ 0.1  ┆ 0.2 ┆ 0.3 ┆ 0.4 ┆ … ┆ 0.2 ┆ 0.3 ┆ 0.4 ┆ 0.5  │
    │ 0.1  ┆ NaN ┆ 0.2 ┆ NaN ┆ … ┆ 0.2 ┆ 0.3 ┆ NaN ┆ NaN  │
    │ NaN  ┆ NaN ┆ NaN ┆ NaN ┆ … ┆ NaN ┆ NaN ┆ NaN ┆ NaN  │
    └──────┴─────┴─────┴─────┴───┴─────┴─────┴─────┴──────┘
    

    I used the walrus operator to create _df so as to not have to invoke .with_row_index twice. If you prefer you can just do df=df.with_row_index('i') first instead.