Search code examples
pythonpython-polars

How to explode multiple List[_] columns with missing values in python polars?


Given a Polars dataframe like below, how can I call explode() on both columns while expanding the null entry to the correct length to match up with its row?

shape: (3, 2)
┌───────────┬─────────────────────┐
│ x         ┆ y                   │
│ ---       ┆ ---                 │
│ list[i64] ┆ list[bool]          │
╞═══════════╪═════════════════════╡
│ [1]       ┆ [true]              │
│ [1, 2]    ┆ null                │
│ [1, 2, 3] ┆ [true, false, true] │
└───────────┴─────────────────────┘

Currently calling df.explode(["x", "y"]) will result in this error.

polars.exceptions.ShapeError: exploded columns must have matching element counts

I'm assuming there's not a built-in way. But I can't find/think of a way to convert that null into a list of correct length, such that the explode will work. Here, the required length is not known statically upfront.

I looked into passing list.len() expressions into repeat_by(), but repeat_by() doesn't support null.


Solution

  • You were on the right track, trying to fill the missing values with a list of null values of correct length.

    To make pl.Expr.repeat_by work with null, we need to ensure that the base expression is of a non-null type. This can be achieved by setting the dtype argument of pl.lit explicity.

    Then, the list column of (lists of) nulls can be used to fill the null values in y. From there, exploding x and y simultaneously works as usually.

    (
        df
        .with_columns(
            pl.col("y").fill_null(
                pl.lit(None, dtype=pl.Boolean).repeat_by(pl.col("x").list.len())
            )
        )
    )
    
    shape: (3, 2)
    ┌───────────┬─────────────────────┐
    │ x         ┆ y                   │
    │ ---       ┆ ---                 │
    │ list[i64] ┆ list[bool]          │
    ╞═══════════╪═════════════════════╡
    │ [1]       ┆ [true]              │
    │ [1, 2]    ┆ [null, null]        │
    │ [1, 2, 3] ┆ [true, false, true] │
    └───────────┴─────────────────────┘
    

    From here, df.explode("x", "y") should work as expected.

    Note. If there are more than two columns, which all might contain null values, one can combine the answer above with this answer to have a valid solution.

    Note.