Search code examples
pythonpython-polars

Repeat rows in a Polars DataFrame based on column value


I would like to expand the following Polars dataframe by repeating rows based on values in the quantity column.

Original DataFrame:

df = pl.from_repr("""
┌────────┬──────────┐
│ Fruit  ┆ Quantity │
│ ---    ┆ ---      │
│ str    ┆ i64      │
╞════════╪══════════╡
│ Apple  ┆ 2        │
│ Banana ┆ 3        │
└────────┴──────────┘
""")

Expected Output:

shape: (5, 3)
┌─────┬────────┬──────────┐
│ ID  ┆ Fruit  ┆ Quantity │
│ --- ┆ ---    ┆ ---      │
│ i64 ┆ str    ┆ i32      │
╞═════╪════════╪══════════╡
│ 100 ┆ Apple  ┆ 1        │
│ 100 ┆ Apple  ┆ 1        │
│ 200 ┆ Banana ┆ 1        │
│ 200 ┆ Banana ┆ 1        │
│ 200 ┆ Banana ┆ 1        │
└─────┴────────┴──────────┘

Here is a very similar question but using Pandas instead of Polars: Repeat rows in a pandas DataFrame based on column value

The polars repeat function does not seem to offer the same functionality as its Pandas counterpart: https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.repeat.html#polars.repeat


Solution

  • You were close. What you were looking for was the repeat_by expression.

    First some data. I'm going to add an ID column, just to show how to apply the repeat_by expression to multiple columns (but exclude Quantity).

    import polars as pl
    
    df = (
        pl.DataFrame({
            'ID' : [100, 200],
            'Fruit': ["Apple", "Banana"],
            'Quantity': [2, 3],
        })
    )
    df
    
    shape: (2, 3)
    ┌─────┬────────┬──────────┐
    │ ID  ┆ Fruit  ┆ Quantity │
    │ --- ┆ ---    ┆ ---      │
    │ i64 ┆ str    ┆ i64      │
    ╞═════╪════════╪══════════╡
    │ 100 ┆ Apple  ┆ 2        │
    │ 200 ┆ Banana ┆ 3        │
    └─────┴────────┴──────────┘
    

    The Algorithm

    (
        df
        .select(
            pl.exclude('Quantity').repeat_by('Quantity').explode()
        )
        .with_columns(
            pl.lit(1).alias('Quantity')
        )
    )
    
    shape: (5, 3)
    ┌─────┬────────┬──────────┐
    │ ID  ┆ Fruit  ┆ Quantity │
    │ --- ┆ ---    ┆ ---      │
    │ i64 ┆ str    ┆ i32      │
    ╞═════╪════════╪══════════╡
    │ 100 ┆ Apple  ┆ 1        │
    │ 100 ┆ Apple  ┆ 1        │
    │ 200 ┆ Banana ┆ 1        │
    │ 200 ┆ Banana ┆ 1        │
    │ 200 ┆ Banana ┆ 1        │
    └─────┴────────┴──────────┘
    

    How it works

    The repeat_by expression will repeat a value in a Series by the value in another column/expression. In this case, we want to repeat by the value in Quantity.

    We'll also use the exclude expression to apply repeat_by to all columns except Quantity (which we'll replace later).

    Note that the result of repeat_by is a list.

    (
        df
        .select(
            pl.exclude('Quantity').repeat_by('Quantity')
        )
    )
    
    shape: (2, 2)
    ┌─────────────────┬────────────────────────────────┐
    │ ID              ┆ Fruit                          │
    │ ---             ┆ ---                            │
    │ list[i64]       ┆ list[str]                      │
    ╞═════════════════╪════════════════════════════════╡
    │ [100, 100]      ┆ ["Apple", "Apple"]             │
    │ [200, 200, 200] ┆ ["Banana", "Banana", "Banana"] │
    └─────────────────┴────────────────────────────────┘
    

    Next, we use explode, which will take each element of each list and place it on its own row.

    (
        df
        .select(
            pl.exclude('Quantity').repeat_by('Quantity').explode()
        )
    )
    
    shape: (5, 2)
    ┌─────┬────────┐
    │ ID  ┆ Fruit  │
    │ --- ┆ ---    │
    │ i64 ┆ str    │
    ╞═════╪════════╡
    │ 100 ┆ Apple  │
    │ 100 ┆ Apple  │
    │ 200 ┆ Banana │
    │ 200 ┆ Banana │
    │ 200 ┆ Banana │
    └─────┴────────┘
    

    From there, we use the lit expression to add Quantity back to the DataFrame.