Repeat rows in a Polars DataFrame based on column value

I would like to expand the following Polars dataframe by repeating rows based on values in the quantity column.

Original DataFrame:

df = pl.from_repr("""
┌────────┬──────────┐
│ Fruit  ┆ Quantity │
│ ---    ┆ ---      │
│ str    ┆ i64      │
╞════════╪══════════╡
│ Apple  ┆ 2        │
│ Banana ┆ 3        │
└────────┴──────────┘
""")

Expected Output:

shape: (5, 3)
┌─────┬────────┬──────────┐
│ ID  ┆ Fruit  ┆ Quantity │
│ --- ┆ ---    ┆ ---      │
│ i64 ┆ str    ┆ i32      │
╞═════╪════════╪══════════╡
│ 100 ┆ Apple  ┆ 1        │
│ 100 ┆ Apple  ┆ 1        │
│ 200 ┆ Banana ┆ 1        │
│ 200 ┆ Banana ┆ 1        │
│ 200 ┆ Banana ┆ 1        │
└─────┴────────┴──────────┘

Here is a very similar question but using Pandas instead of Polars: Repeat rows in a pandas DataFrame based on column value

The polars repeat function does not seem to offer the same functionality as its Pandas counterpart: https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.repeat.html#polars.repeat

Solution

You were close. What you were looking for was the repeat_by expression.

First some data. I'm going to add an ID column, just to show how to apply the repeat_by expression to multiple columns (but exclude Quantity).

import polars as pl

df = (
    pl.DataFrame({
        'ID' : [100, 200],
        'Fruit': ["Apple", "Banana"],
        'Quantity': [2, 3],
    })
)
df

shape: (2, 3)
┌─────┬────────┬──────────┐
│ ID  ┆ Fruit  ┆ Quantity │
│ --- ┆ ---    ┆ ---      │
│ i64 ┆ str    ┆ i64      │
╞═════╪════════╪══════════╡
│ 100 ┆ Apple  ┆ 2        │
│ 200 ┆ Banana ┆ 3        │
└─────┴────────┴──────────┘

The Algorithm

(
    df
    .select(
        pl.exclude('Quantity').repeat_by('Quantity').explode()
    )
    .with_columns(
        pl.lit(1).alias('Quantity')
    )
)

shape: (5, 3)
┌─────┬────────┬──────────┐
│ ID  ┆ Fruit  ┆ Quantity │
│ --- ┆ ---    ┆ ---      │
│ i64 ┆ str    ┆ i32      │
╞═════╪════════╪══════════╡
│ 100 ┆ Apple  ┆ 1        │
│ 100 ┆ Apple  ┆ 1        │
│ 200 ┆ Banana ┆ 1        │
│ 200 ┆ Banana ┆ 1        │
│ 200 ┆ Banana ┆ 1        │
└─────┴────────┴──────────┘

How it works

The repeat_by expression will repeat a value in a Series by the value in another column/expression. In this case, we want to repeat by the value in Quantity.

We'll also use the exclude expression to apply repeat_by to all columns except Quantity (which we'll replace later).

Note that the result of repeat_by is a list.

(
    df
    .select(
        pl.exclude('Quantity').repeat_by('Quantity')
    )
)

shape: (2, 2)
┌─────────────────┬────────────────────────────────┐
│ ID              ┆ Fruit                          │
│ ---             ┆ ---                            │
│ list[i64]       ┆ list[str]                      │
╞═════════════════╪════════════════════════════════╡
│ [100, 100]      ┆ ["Apple", "Apple"]             │
│ [200, 200, 200] ┆ ["Banana", "Banana", "Banana"] │
└─────────────────┴────────────────────────────────┘

Next, we use explode, which will take each element of each list and place it on its own row.

(
    df
    .select(
        pl.exclude('Quantity').repeat_by('Quantity').explode()
    )
)

shape: (5, 2)
┌─────┬────────┐
│ ID  ┆ Fruit  │
│ --- ┆ ---    │
│ i64 ┆ str    │
╞═════╪════════╡
│ 100 ┆ Apple  │
│ 100 ┆ Apple  │
│ 200 ┆ Banana │
│ 200 ┆ Banana │
│ 200 ┆ Banana │
└─────┴────────┘

From there, we use the lit expression to add Quantity back to the DataFrame.