Search code examples
pythonpython-polars

Add new column with multiple literal values to polars dataframe


Consider the following toy example:

import polars as pl

pl.Config(tbl_rows=-1)

df = pl.DataFrame({"group": ["A", "A", "A", "B", "B"], "value": [1, 2, 3, 4, 5]})

print(df)

shape: (5, 2)
┌───────┬───────┐
│ group ┆ value │
│ ---   ┆ ---   │
│ str   ┆ i64   │
╞═══════╪═══════╡
│ A     ┆ 1     │
│ A     ┆ 2     │
│ A     ┆ 3     │
│ B     ┆ 4     │
│ B     ┆ 5     │
└───────┴───────┘

Further, I have a list of indicator values, such as vals=[10, 20, 30].

I am looking for an efficient way to insert each of these values in a new column called ìndicator using pl.lit() while expanding the dataframe vertically in a way all existing rows will be repeated for every new element in vals.

My current solution is to insert a new column to df, append it to a list and subsequently do a pl.concat.

lit_vals = [10, 20, 30]

print(pl.concat([df.with_columns(indicator=pl.lit(val)) for val in lit_vals]))

shape: (15, 3)
┌───────┬───────┬───────────┐
│ group ┆ value ┆ indicator │
│ ---   ┆ ---   ┆ ---       │
│ str   ┆ i64   ┆ i32       │
╞═══════╪═══════╪═══════════╡
│ A     ┆ 1     ┆ 10        │
│ A     ┆ 2     ┆ 10        │
│ A     ┆ 3     ┆ 10        │
│ B     ┆ 4     ┆ 10        │
│ B     ┆ 5     ┆ 10        │
│ A     ┆ 1     ┆ 20        │
│ A     ┆ 2     ┆ 20        │
│ A     ┆ 3     ┆ 20        │
│ B     ┆ 4     ┆ 20        │
│ B     ┆ 5     ┆ 20        │
│ A     ┆ 1     ┆ 30        │
│ A     ┆ 2     ┆ 30        │
│ A     ┆ 3     ┆ 30        │
│ B     ┆ 4     ┆ 30        │
│ B     ┆ 5     ┆ 30        │
└───────┴───────┴───────────┘

As df could potentially have quite a lot of rows and columns, I am wondering if my solution is efficient in terms of speed as well as memory allocation?

Just for my understanding, if I append a new pl.DataFrame to the list, will this dataframe use additional memory or will just some new pointers be created that link to the chunks in memory which hold the data of the original df?


Solution

  • You could assign it as a column and .explode()

    df.with_columns(indicator=vals).explode("indicator")
    
    shape: (15, 3)
    ┌───────┬───────┬───────────┐
    │ group ┆ value ┆ indicator │
    │ ---   ┆ ---   ┆ ---       │
    │ str   ┆ i64   ┆ i64       │
    ╞═══════╪═══════╪═══════════╡
    │ A     ┆ 1     ┆ 10        │
    │ A     ┆ 1     ┆ 20        │
    │ A     ┆ 1     ┆ 30        │
    │ A     ┆ 2     ┆ 10        │
    │ A     ┆ 2     ┆ 20        │
    │ …     ┆ …     ┆ …         │
    │ B     ┆ 4     ┆ 20        │
    │ B     ┆ 4     ┆ 30        │
    │ B     ┆ 5     ┆ 10        │
    │ B     ┆ 5     ┆ 20        │
    │ B     ┆ 5     ┆ 30        │
    └───────┴───────┴───────────┘
    

    To specify a dtype, you can use pl.lit()

    (df.with_columns(indicator=pl.lit(vals, dtype=pl.List(pl.UInt8)))
       .explode("indicator")
    )
    
    shape: (15, 3)
    ┌───────┬───────┬───────────┐
    │ group ┆ value ┆ indicator │
    │ ---   ┆ ---   ┆ ---       │
    │ str   ┆ i64   ┆ u8        │
    ╞═══════╪═══════╪═══════════╡
    │ A     ┆ 1     ┆ 10        │
    │ A     ┆ 1     ┆ 20        │
    │ A     ┆ 1     ┆ 30        │
    │ A     ┆ 2     ┆ 10        │
    │ A     ┆ 2     ┆ 20        │
    │ …     ┆ …     ┆ …         │
    │ B     ┆ 4     ┆ 20        │
    │ B     ┆ 4     ┆ 30        │
    │ B     ┆ 5     ┆ 10        │
    │ B     ┆ 5     ┆ 20        │
    │ B     ┆ 5     ┆ 30        │
    └───────┴───────┴───────────┘