Search code examples
dataframeslicepython-polars

Slicing multiple chunks in a polars dataframe


Consider the following dataframe.

df = pl.DataFrame(data={"col1": range(10)})
┌──────┐
│ col1 │
│ ---  │
│ i64  │
╞══════╡
│ 0    │
│ 1    │
│ 2    │
│ 3    │
│ 4    │
│ 5    │
│ 6    │
│ 7    │
│ 8    │
│ 9    │
└──────┘

Let's say I have a list of tuples, where the first value represents the start index and the second value a length value (as used in pl.DataFrame.slice). This might look like this:

slices = [(1,2), (5,3)]

Now, what's a good way to slice/extract two chunks out of df, whereby the first slice starts in row 1 and has a length of 2, while the second chunk starts at row 5 and has a length of 3.

Here's what I am looking for:

┌──────┐
│ col1 │
│ ---  │
│ i64  │
╞══════╡
│ 1    │
│ 2    │
│ 5    │
│ 6    │
│ 7    │
└──────┘

Solution

  • You could use pl.DataFrame.slice to obtain each slice separately and then use pl.concat to concatenate all slices.

    pl.concat(df.slice(*slice) for slice in slices)
    
    shape: (5, 1)
    ┌──────┐
    │ col1 │
    │ ---  │
    │ i64  │
    ╞══════╡
    │ 1    │
    │ 2    │
    │ 5    │
    │ 6    │
    │ 7    │
    └──────┘
    

    Edit. As an attempt for a vectorized approach, you could first use the list of slice parameters to create a dataframe of indices (using pl.int_ranges and pl.DataFrame.explode). Afterwards, this dataframe of indices can be used to slice the df with join.

    indices = (
        pl.DataFrame(slices, orient="row", schema=["offset", "length"])
        .select(
            index=pl.int_ranges("offset", pl.col("offset") + pl.col("length"))
        )
        .explode("index")
    )
    
    shape: (5, 1)
    ┌───────┐
    │ index │
    │ ---   │
    │ i64   │
    ╞═══════╡
    │ 1     │
    │ 2     │
    │ 5     │
    │ 6     │
    │ 7     │
    └───────┘
    
    (
        indices
        .join(
            df,
            left_on="index",
            right_on=pl.int_range(pl.len()),
            how="left",
            coalesce=True,
        )
        .drop("index")
    )
    
    shape: (5, 1)
    ┌──────┐
    │ col1 │
    │ ---  │
    │ i64  │
    ╞══════╡
    │ 1    │
    │ 2    │
    │ 5    │
    │ 6    │
    │ 7    │
    └──────┘