Search code examples
pythonpython-polars

How to average lists on different columns using polars LazyFrame


I have a polars LazyFrame which has 3 columns of type nullable list[f64], something like this.

import polars as pl

lf = pl.DataFrame({
    "1": [
        [0.0, 1.1, 2.2],
        [0.0, 1.1, 2.2],
        [0.0, 1.1, 2.2],
        None,
    ],
    "2": [
        [0.3, 1.3, 2.3],
        [0.4, 1.4, 2.4],
        [0.5, 1.5, 2.5],
        None,
    ],
    "3": [
        [0.7, 1.7, 2.7],
        None,
        [0.9, 1.9, 2.9],
        None,
    ],
}).lazy()
┌─────────────────┬─────────────────┬─────────────────┐
│ 1               ┆ 2               ┆ 3               │
│ ---             ┆ ---             ┆ ---             │
│ list[f64]       ┆ list[f64]       ┆ list[f64]       │
╞═════════════════╪═════════════════╪═════════════════╡
│ [0.0, 1.1, 2.2] ┆ [0.3, 1.3, 2.3] ┆ [0.7, 1.7, 2.7] │
│ [0.0, 1.1, 2.2] ┆ [0.4, 1.4, 2.4] ┆ null            │
│ [0.0, 1.1, 2.2] ┆ [0.5, 1.5, 2.5] ┆ [0.9, 1.9, 2.9] │
│ null            ┆ null            ┆ null            │
└─────────────────┴─────────────────┴─────────────────┘

I need to add a column with the average of the three column's lists, furthermore

  • when in a row there are only null, then avg will be a zeros-filled list of fixed length 3
  • when one item is null the average will be computed on the non-null lists

For "average of the lists" I mean the element-wise sum divided by the number of lists involved in the sum.

So in the first row I want:

[
    0.0 + 0.3 + 0.7,
    1.1 + 1.3 + 1.7,
    2.2 + 2.3 + 2.7
] / 3
=
[
    1.0, 4.1, 7.2
] / 3
=
[0.33, 1.36, 2.40]

In the second row: [0.0 + 0.4, 1.1 + 1.4, 2.2 + 2.4] / 2 = [0.4, 2.8, 4.6] / 2 = [0.2, 1.4, 2.3].

In the last row: [0.0, 0.0, 0.0].

I found a way to sum the columns

lf.select(
    pl.sum_horizontal(
        pl.all().list.explode()
    ).reshape((1, -1)).alias("sum"),
).collect()

But this only works when all the items in row are non-null.


Solution

  • Update: As of Polars 1.10.0 literals/scalars are broadcasted for the List type.

    df.with_columns(avg = pl.sum_horizontal(pl.all().fill_null([0] * 3)) / 3)
    
    shape: (4, 4)
    ┌─────────────────┬─────────────────┬─────────────────┬────────────────────────────────┐
    │ 1               ┆ 2               ┆ 3               ┆ avg                            │
    │ ---             ┆ ---             ┆ ---             ┆ ---                            │
    │ list[f64]       ┆ list[f64]       ┆ list[f64]       ┆ list[f64]                      │
    ╞═════════════════╪═════════════════╪═════════════════╪════════════════════════════════╡
    │ [0.0, 1.1, 2.2] ┆ [0.3, 1.3, 2.3] ┆ [0.7, 1.7, 2.7] ┆ [0.333333, 1.366667, 2.4]      │
    │ [0.0, 1.1, 2.2] ┆ [0.4, 1.4, 2.4] ┆ null            ┆ [0.133333, 0.833333, 1.533333] │
    │ [0.0, 1.1, 2.2] ┆ [0.5, 1.5, 2.5] ┆ [0.9, 1.9, 2.9] ┆ [0.466667, 1.5, 2.533333]      │
    │ null            ┆ null            ┆ null            ┆ [0.0, 0.0, 0.0]                │
    └─────────────────┴─────────────────┴─────────────────┴────────────────────────────────┘
    

    Update: As of Polars 1.8.0 list arithmetic is now supported.

    df.select(pl.sum_horizontal(pl.all().fill_null([0] * 3)))
    
    shape: (4, 1)
    ┌─────────────────┐
    │ 1               │
    │ ---             │
    │ list[f64]       │
    ╞═════════════════╡
    │ [1.0, 4.1, 7.2] │
    │ [0.4, 2.5, 4.6] │
    │ [1.4, 4.5, 7.6] │
    │ [0.0, 0.0, 0.0] │
    └─────────────────┘
    

    Follow on work will add support for broadcasting of literals and scalars (i.e. / 3)

    You can add pl.lit([3, 3, 3]) as a column for now.

    Broadcasting of literals for the Array type was also added in 1.8.0

    width = 3
    dtype = pl.Array(float, width)
    
    (df.cast(dtype)
       .with_columns( 
          pl.sum_horizontal(pl.all().fill_null(pl.lit([0] * width, dtype)))
            .alias("avg")
          / 
          pl.lit([width] * width, dtype)
       )
    )
    

    Original answer:

    Perhaps you could expand on your initial attempt by replacing null with [0, 0, 0] and manually creating a list of [width, width, width] to divide by.

    width = pl.sum_horizontal(pl.all().is_not_null()) 
    
    df.with_columns(avg = 
       (pl.sum_horizontal(
          pl.all().fill_null([0, 0, 0]).list.explode()
       ) / pl.concat_list(width, width, width).list.explode())
       .fill_nan(0)
       .reshape((-1, 3))
    )
    
    shape: (4, 4)
    ┌─────────────────┬─────────────────┬─────────────────┬───────────────────────────┐
    │ 1               ┆ 2               ┆ 3               ┆ avg                       │
    │ ---             ┆ ---             ┆ ---             ┆ ---                       │
    │ list[f64]       ┆ list[f64]       ┆ list[f64]       ┆ list[f64]                 │
    ╞═════════════════╪═════════════════╪═════════════════╪═══════════════════════════╡
    │ [0.0, 1.1, 2.2] ┆ [0.3, 1.3, 2.3] ┆ [0.7, 1.7, 2.7] ┆ [0.333333, 1.366667, 2.4] │
    │ [0.0, 1.1, 2.2] ┆ [0.4, 1.4, 2.4] ┆ null            ┆ [0.2, 1.25, 2.3]          │
    │ [0.0, 1.1, 2.2] ┆ [0.5, 1.5, 2.5] ┆ [0.9, 1.9, 2.9] ┆ [0.466667, 1.5, 2.533333] │
    │ null            ┆ null            ┆ null            ┆ [0.0, 0.0, 0.0]           │
    └─────────────────┴─────────────────┴─────────────────┴───────────────────────────┘