I have a polars LazyFrame
which has 3 columns of type nullable list[f64]
, something like this.
import polars as pl
lf = pl.DataFrame({
"1": [
[0.0, 1.1, 2.2],
[0.0, 1.1, 2.2],
[0.0, 1.1, 2.2],
None,
],
"2": [
[0.3, 1.3, 2.3],
[0.4, 1.4, 2.4],
[0.5, 1.5, 2.5],
None,
],
"3": [
[0.7, 1.7, 2.7],
None,
[0.9, 1.9, 2.9],
None,
],
}).lazy()
┌─────────────────┬─────────────────┬─────────────────┐
│ 1 ┆ 2 ┆ 3 │
│ --- ┆ --- ┆ --- │
│ list[f64] ┆ list[f64] ┆ list[f64] │
╞═════════════════╪═════════════════╪═════════════════╡
│ [0.0, 1.1, 2.2] ┆ [0.3, 1.3, 2.3] ┆ [0.7, 1.7, 2.7] │
│ [0.0, 1.1, 2.2] ┆ [0.4, 1.4, 2.4] ┆ null │
│ [0.0, 1.1, 2.2] ┆ [0.5, 1.5, 2.5] ┆ [0.9, 1.9, 2.9] │
│ null ┆ null ┆ null │
└─────────────────┴─────────────────┴─────────────────┘
I need to add a column with the average of the three column's lists, furthermore
null
, then avg will be a zeros-filled list of fixed length 3
null
the average will be computed on the non-null listsFor "average of the lists" I mean the element-wise sum divided by the number of lists involved in the sum.
So in the first row I want:
[
0.0 + 0.3 + 0.7,
1.1 + 1.3 + 1.7,
2.2 + 2.3 + 2.7
] / 3
=
[
1.0, 4.1, 7.2
] / 3
=
[0.33, 1.36, 2.40]
In the second row: [0.0 + 0.4, 1.1 + 1.4, 2.2 + 2.4] / 2 = [0.4, 2.8, 4.6] / 2 = [0.2, 1.4, 2.3]
.
In the last row: [0.0, 0.0, 0.0]
.
I found a way to sum the columns
lf.select(
pl.sum_horizontal(
pl.all().list.explode()
).reshape((1, -1)).alias("sum"),
).collect()
But this only works when all the items in row are non-null.
Update: As of Polars 1.10.0 literals/scalars are broadcasted for the List type.
df.with_columns(avg = pl.sum_horizontal(pl.all().fill_null([0] * 3)) / 3)
shape: (4, 4)
┌─────────────────┬─────────────────┬─────────────────┬────────────────────────────────┐
│ 1 ┆ 2 ┆ 3 ┆ avg │
│ --- ┆ --- ┆ --- ┆ --- │
│ list[f64] ┆ list[f64] ┆ list[f64] ┆ list[f64] │
╞═════════════════╪═════════════════╪═════════════════╪════════════════════════════════╡
│ [0.0, 1.1, 2.2] ┆ [0.3, 1.3, 2.3] ┆ [0.7, 1.7, 2.7] ┆ [0.333333, 1.366667, 2.4] │
│ [0.0, 1.1, 2.2] ┆ [0.4, 1.4, 2.4] ┆ null ┆ [0.133333, 0.833333, 1.533333] │
│ [0.0, 1.1, 2.2] ┆ [0.5, 1.5, 2.5] ┆ [0.9, 1.9, 2.9] ┆ [0.466667, 1.5, 2.533333] │
│ null ┆ null ┆ null ┆ [0.0, 0.0, 0.0] │
└─────────────────┴─────────────────┴─────────────────┴────────────────────────────────┘
Update: As of Polars 1.8.0 list arithmetic is now supported.
df.select(pl.sum_horizontal(pl.all().fill_null([0] * 3)))
shape: (4, 1)
┌─────────────────┐
│ 1 │
│ --- │
│ list[f64] │
╞═════════════════╡
│ [1.0, 4.1, 7.2] │
│ [0.4, 2.5, 4.6] │
│ [1.4, 4.5, 7.6] │
│ [0.0, 0.0, 0.0] │
└─────────────────┘
Follow on work will add support for broadcasting of literals and scalars (i.e. / 3
)
You can add pl.lit([3, 3, 3])
as a column for now.
Broadcasting of literals for the Array type was also added in 1.8.0
width = 3
dtype = pl.Array(float, width)
(df.cast(dtype)
.with_columns(
pl.sum_horizontal(pl.all().fill_null(pl.lit([0] * width, dtype)))
.alias("avg")
/
pl.lit([width] * width, dtype)
)
)
Original answer:
Perhaps you could expand on your initial attempt by replacing null
with [0, 0, 0]
and manually creating a list of [width, width, width]
to divide by.
width = pl.sum_horizontal(pl.all().is_not_null())
df.with_columns(avg =
(pl.sum_horizontal(
pl.all().fill_null([0, 0, 0]).list.explode()
) / pl.concat_list(width, width, width).list.explode())
.fill_nan(0)
.reshape((-1, 3))
)
shape: (4, 4)
┌─────────────────┬─────────────────┬─────────────────┬───────────────────────────┐
│ 1 ┆ 2 ┆ 3 ┆ avg │
│ --- ┆ --- ┆ --- ┆ --- │
│ list[f64] ┆ list[f64] ┆ list[f64] ┆ list[f64] │
╞═════════════════╪═════════════════╪═════════════════╪═══════════════════════════╡
│ [0.0, 1.1, 2.2] ┆ [0.3, 1.3, 2.3] ┆ [0.7, 1.7, 2.7] ┆ [0.333333, 1.366667, 2.4] │
│ [0.0, 1.1, 2.2] ┆ [0.4, 1.4, 2.4] ┆ null ┆ [0.2, 1.25, 2.3] │
│ [0.0, 1.1, 2.2] ┆ [0.5, 1.5, 2.5] ┆ [0.9, 1.9, 2.9] ┆ [0.466667, 1.5, 2.533333] │
│ null ┆ null ┆ null ┆ [0.0, 0.0, 0.0] │
└─────────────────┴─────────────────┴─────────────────┴───────────────────────────┘