I would like to expand the following Polars dataframe by repeating rows based on values in the quantity column.
Original DataFrame:
df = pl.from_repr("""
┌────────┬──────────┐
│ Fruit ┆ Quantity │
│ --- ┆ --- │
│ str ┆ i64 │
╞════════╪══════════╡
│ Apple ┆ 2 │
│ Banana ┆ 3 │
└────────┴──────────┘
""")
Expected Output:
shape: (5, 3)
┌─────┬────────┬──────────┐
│ ID ┆ Fruit ┆ Quantity │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ i32 │
╞═════╪════════╪══════════╡
│ 100 ┆ Apple ┆ 1 │
│ 100 ┆ Apple ┆ 1 │
│ 200 ┆ Banana ┆ 1 │
│ 200 ┆ Banana ┆ 1 │
│ 200 ┆ Banana ┆ 1 │
└─────┴────────┴──────────┘
Here is a very similar question but using Pandas instead of Polars: Repeat rows in a pandas DataFrame based on column value
The polars repeat function does not seem to offer the same functionality as its Pandas counterpart: https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.repeat.html#polars.repeat
You were close. What you were looking for was the repeat_by
expression.
First some data. I'm going to add an ID
column, just to show how to apply the repeat_by
expression to multiple columns (but exclude Quantity
).
import polars as pl
df = (
pl.DataFrame({
'ID' : [100, 200],
'Fruit': ["Apple", "Banana"],
'Quantity': [2, 3],
})
)
df
shape: (2, 3)
┌─────┬────────┬──────────┐
│ ID ┆ Fruit ┆ Quantity │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ i64 │
╞═════╪════════╪══════════╡
│ 100 ┆ Apple ┆ 2 │
│ 200 ┆ Banana ┆ 3 │
└─────┴────────┴──────────┘
(
df
.select(
pl.exclude('Quantity').repeat_by('Quantity').explode()
)
.with_columns(
pl.lit(1).alias('Quantity')
)
)
shape: (5, 3)
┌─────┬────────┬──────────┐
│ ID ┆ Fruit ┆ Quantity │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ i32 │
╞═════╪════════╪══════════╡
│ 100 ┆ Apple ┆ 1 │
│ 100 ┆ Apple ┆ 1 │
│ 200 ┆ Banana ┆ 1 │
│ 200 ┆ Banana ┆ 1 │
│ 200 ┆ Banana ┆ 1 │
└─────┴────────┴──────────┘
The repeat_by
expression will repeat a value in a Series by the value in another column/expression. In this case, we want to repeat by the value in Quantity
.
We'll also use the exclude
expression to apply repeat_by
to all columns except Quantity
(which we'll replace later).
Note that the result of repeat_by
is a list.
(
df
.select(
pl.exclude('Quantity').repeat_by('Quantity')
)
)
shape: (2, 2)
┌─────────────────┬────────────────────────────────┐
│ ID ┆ Fruit │
│ --- ┆ --- │
│ list[i64] ┆ list[str] │
╞═════════════════╪════════════════════════════════╡
│ [100, 100] ┆ ["Apple", "Apple"] │
│ [200, 200, 200] ┆ ["Banana", "Banana", "Banana"] │
└─────────────────┴────────────────────────────────┘
Next, we use explode
, which will take each element of each list and place it on its own row.
(
df
.select(
pl.exclude('Quantity').repeat_by('Quantity').explode()
)
)
shape: (5, 2)
┌─────┬────────┐
│ ID ┆ Fruit │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪════════╡
│ 100 ┆ Apple │
│ 100 ┆ Apple │
│ 200 ┆ Banana │
│ 200 ┆ Banana │
│ 200 ┆ Banana │
└─────┴────────┘
From there, we use the lit
expression to add Quantity
back to the DataFrame.