I have a polars dataframe:
pl.DataFrame({'a':[[1,3], [1,5]]})
a
list
[1, 3]
[1, 5]
and I'd like to do some kind of vectorized operation to expand this into:
a
list
[1, 2, 3]
[1, 2, 3, 4, 5]
A solution I've come up with is splitting the array into two columns (init
, and final
), then doing pl.struct(['init', 'final'])
followed by apply
to get the range.
def get_valid_codes(struct: dict) -> list:
code_range = set(range(struct['init'], struct['final'] + 1))
codes = list(set.intersection(valid_codes, code_range))
return codes if codes else [0]
This is slow for my dataset (300M rows) and I'm wondering if there's a better way.
Bonus points if you can figure out how to filter out certain (predefined) values from the lists.
Let's expand the data so we can show some logic for 'bad codes'.
import polars as pl
df = pl.DataFrame({"a": [[1, 3], [1, 5], [7, 9], [3, 7], [9, 13], [5, 11]]})
print(df)
shape: (6, 1)
┌───────────┐
│ a │
│ --- │
│ list[i64] │
╞═══════════╡
│ [1, 3] │
│ [1, 5] │
│ [7, 9] │
│ [3, 7] │
│ [9, 13] │
│ [5, 11] │
└───────────┘
We'll use 6 through 10 as 'bad codes' to weed out.
# pl.Config(fmt_table_cell_list_len=10) # increase list repr
bad_codes = [6, 7, 8, 9, 10]
df.with_columns(
pl.int_ranges(pl.col("a").list.first(), pl.col("a").list.last() + 1)
.list.set_difference(bad_codes)
.list.sort() # set_difference does not retain order
.alias("result")
)
shape: (6, 2)
┌───────────┬─────────────────┐
│ a ┆ result │
│ --- ┆ --- │
│ list[i64] ┆ list[i64] │
╞═══════════╪═════════════════╡
│ [1, 3] ┆ [1, 2, 3] │
│ [1, 5] ┆ [1, 2, 3, 4, 5] │
│ [7, 9] ┆ [] │
│ [3, 7] ┆ [3, 4, 5] │
│ [9, 13] ┆ [11, 12, 13] │
│ [5, 11] ┆ [5, 11] │
└───────────┴─────────────────┘
This algorithm leaves an empty list []
when all codes are "bad codes". If you need a [0]
instead of an empty list, you can use a pl.when
and the .list.len
expression to change those to [0]
.