Consider the following toy example:
import polars as pl
pl.Config(tbl_rows=-1)
df = pl.DataFrame({"group": ["A", "A", "A", "B", "B"], "value": [1, 2, 3, 4, 5]})
print(df)
shape: (5, 2)
┌───────┬───────┐
│ group ┆ value │
│ --- ┆ --- │
│ str ┆ i64 │
╞═══════╪═══════╡
│ A ┆ 1 │
│ A ┆ 2 │
│ A ┆ 3 │
│ B ┆ 4 │
│ B ┆ 5 │
└───────┴───────┘
Further, I have a list of indicator values, such as vals=[10, 20, 30]
.
I am looking for an efficient way to insert each of these values in a new column called ìndicator
using pl.lit()
while expanding the dataframe vertically in a way all existing rows will be repeated for every new element in vals
.
My current solution is to insert a new column to df
, append it to a list and subsequently do a pl.concat
.
lit_vals = [10, 20, 30]
print(pl.concat([df.with_columns(indicator=pl.lit(val)) for val in lit_vals]))
shape: (15, 3)
┌───────┬───────┬───────────┐
│ group ┆ value ┆ indicator │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i32 │
╞═══════╪═══════╪═══════════╡
│ A ┆ 1 ┆ 10 │
│ A ┆ 2 ┆ 10 │
│ A ┆ 3 ┆ 10 │
│ B ┆ 4 ┆ 10 │
│ B ┆ 5 ┆ 10 │
│ A ┆ 1 ┆ 20 │
│ A ┆ 2 ┆ 20 │
│ A ┆ 3 ┆ 20 │
│ B ┆ 4 ┆ 20 │
│ B ┆ 5 ┆ 20 │
│ A ┆ 1 ┆ 30 │
│ A ┆ 2 ┆ 30 │
│ A ┆ 3 ┆ 30 │
│ B ┆ 4 ┆ 30 │
│ B ┆ 5 ┆ 30 │
└───────┴───────┴───────────┘
As df
could potentially have quite a lot of rows and columns, I am wondering if my solution is efficient in terms of speed as well as memory allocation?
Just for my understanding, if I append a new pl.DataFrame
to the list, will this dataframe use additional memory or will just some new pointers be created that link to the chunks in memory which hold the data of the original df
?
You could assign it as a column and .explode()
df.with_columns(indicator=vals).explode("indicator")
shape: (15, 3)
┌───────┬───────┬───────────┐
│ group ┆ value ┆ indicator │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 │
╞═══════╪═══════╪═══════════╡
│ A ┆ 1 ┆ 10 │
│ A ┆ 1 ┆ 20 │
│ A ┆ 1 ┆ 30 │
│ A ┆ 2 ┆ 10 │
│ A ┆ 2 ┆ 20 │
│ … ┆ … ┆ … │
│ B ┆ 4 ┆ 20 │
│ B ┆ 4 ┆ 30 │
│ B ┆ 5 ┆ 10 │
│ B ┆ 5 ┆ 20 │
│ B ┆ 5 ┆ 30 │
└───────┴───────┴───────────┘
To specify a dtype, you can use pl.lit()
(df.with_columns(indicator=pl.lit(vals, dtype=pl.List(pl.UInt8)))
.explode("indicator")
)
shape: (15, 3)
┌───────┬───────┬───────────┐
│ group ┆ value ┆ indicator │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ u8 │
╞═══════╪═══════╪═══════════╡
│ A ┆ 1 ┆ 10 │
│ A ┆ 1 ┆ 20 │
│ A ┆ 1 ┆ 30 │
│ A ┆ 2 ┆ 10 │
│ A ┆ 2 ┆ 20 │
│ … ┆ … ┆ … │
│ B ┆ 4 ┆ 20 │
│ B ┆ 4 ┆ 30 │
│ B ┆ 5 ┆ 10 │
│ B ┆ 5 ┆ 20 │
│ B ┆ 5 ┆ 30 │
└───────┴───────┴───────────┘