I have a dataframe like:
data = {
"a": [[1], [2], [3, 4], [5, 6, 7]],
"b": [[], [8], [9, 10], [11, 12]],
}
df = pl.DataFrame(data)
"""
┌───────────┬───────────┐
│ a ┆ b │
│ --- ┆ --- │
│ list[i64] ┆ list[i64] │
╞═══════════╪═══════════╡
│ [1] ┆ [] │
│ [2] ┆ [8] │
│ [3, 4] ┆ [9, 10] │
│ [5, 6, 7] ┆ [11, 12] │
└───────────┴───────────┘
"""
Each pair of lists may not have the same length, and I want to "truncate" the explode to the shortest of both lists:
"""
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 2 ┆ 8 │
│ 3 ┆ 9 │
│ 4 ┆ 10 │
│ 5 ┆ 11 │
│ 6 ┆ 12 │
└─────┴─────┘
"""
I was thinking that maybe I'd have to fill the shortest of both lists with None
to match both lengths, and then drop_nulls
. But I was wondering if there was a more direct approach to this?
Here's one approach:
min_length = pl.min_horizontal(pl.col('a', 'b').list.len())
out = (df.filter(min_length != 0)
.with_columns(
pl.col('a', 'b').list.head(min_length)
)
.explode('a', 'b')
)
Output:
shape: (5, 2)
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 2 ┆ 8 │
│ 3 ┆ 9 │
│ 4 ┆ 10 │
│ 5 ┆ 11 │
│ 6 ┆ 12 │
└─────┴─────┘
Explanation
Expr.list.len
and get the shortest for each row with pl.min_horizontal
.min_length == 0
(df.filter
) and inside df.with_columns
select the first n values of each list with Expr.list.head
.df.explode
.