Search code examples
python-polars

How to turn `[1, 5]` into `[1, 2, 3, 4, 5]` in a DataFrame column of list type?


I have a polars dataframe:

pl.DataFrame({'a':[[1,3], [1,5]]})
a
list
[1, 3]
[1, 5]

and I'd like to do some kind of vectorized operation to expand this into:

a
list
[1, 2, 3]
[1, 2, 3, 4, 5]

A solution I've come up with is splitting the array into two columns (init, and final), then doing pl.struct(['init', 'final']) followed by apply to get the range.

def get_valid_codes(struct: dict) -> list:
    code_range = set(range(struct['init'], struct['final'] + 1))
    codes      =  list(set.intersection(valid_codes, code_range))
    return codes if codes else [0]

This is slow for my dataset (300M rows) and I'm wondering if there's a better way.

Bonus points if you can figure out how to filter out certain (predefined) values from the lists.


Solution

  • Let's expand the data so we can show some logic for 'bad codes'.

    import polars as pl
    
    df = pl.DataFrame({"a": [[1, 3], [1, 5], [7, 9], [3, 7], [9, 13], [5, 11]]})
    print(df)
    
    shape: (6, 1)
    ┌───────────┐
    │ a         │
    │ ---       │
    │ list[i64] │
    ╞═══════════╡
    │ [1, 3]    │
    │ [1, 5]    │
    │ [7, 9]    │
    │ [3, 7]    │
    │ [9, 13]   │
    │ [5, 11]   │
    └───────────┘
    

    We'll use 6 through 10 as 'bad codes' to weed out.

    # pl.Config(fmt_table_cell_list_len=10) # increase list repr
    
    bad_codes = [6, 7, 8, 9, 10]
    
    df.with_columns(
        pl.int_ranges(pl.col("a").list.first(), pl.col("a").list.last() + 1)
          .list.set_difference(bad_codes)
          .list.sort() # set_difference does not retain order
          .alias("result")
    )
    
    shape: (6, 2)
    ┌───────────┬─────────────────┐
    │ a         ┆ result          │
    │ ---       ┆ ---             │
    │ list[i64] ┆ list[i64]       │
    ╞═══════════╪═════════════════╡
    │ [1, 3]    ┆ [1, 2, 3]       │
    │ [1, 5]    ┆ [1, 2, 3, 4, 5] │
    │ [7, 9]    ┆ []              │
    │ [3, 7]    ┆ [3, 4, 5]       │
    │ [9, 13]   ┆ [11, 12, 13]    │
    │ [5, 11]   ┆ [5, 11]         │
    └───────────┴─────────────────┘
    

    This algorithm leaves an empty list [] when all codes are "bad codes". If you need a [0] instead of an empty list, you can use a pl.when and the .list.len expression to change those to [0].