Search code examples
pythondataframepython-polars

Split value between polars DataFrame rows


I would like to find a way to distribute the values of a DataFrame among the rows of another DataFrame using polars (without iterating through the rows).

import polars as pl

df1 = pl.DataFrame({"Name": ["A", "B", "C"], "Amount": [100, 300, 250]})
df2 = pl.DataFrame({"Name": ["A", "A", "B", "B", "B", "C"], "Item": ["x1", "x2", "y1", "y2", "y3", "z"],"Price": [40, 60, 50, 150, 200, 400]})

I have df1 with the amounts to be distributed:

shape: (3, 2)
┌──────┬────────┐
│ Name ┆ Amount │
│ ---  ┆ ---    │
│ str  ┆ i64    │
╞══════╪════════╡
│ A    ┆ 100    │
│ B    ┆ 300    │
│ C    ┆ 250    │
└──────┴────────┘

And df2 to which I want to append the distributed values (in a new column) using the common "Name" column.

shape: (6, 3)
┌──────┬──────┬───────┐
│ Name ┆ Item ┆ Price │
│ ---  ┆ ---  ┆ ---   │
│ str  ┆ str  ┆ i64   │
╞══════╪══════╪═══════╡
│ A    ┆ x1   ┆ 40    │
│ A    ┆ x2   ┆ 60    │
│ B    ┆ y1   ┆ 50    │
│ B    ┆ y2   ┆ 150   │
│ B    ┆ y3   ┆ 200   │
│ C    ┆ z    ┆ 400   │
└──────┴──────┴───────┘

The rows in the target are sorted and the assigned amount should match the price in each row (as long as there is enough amount remaining).

So the result in this case should look like this:

shape: (6, 5)
┌──────┬──────┬───────┬────────┬─────────────────┐
│ Name ┆ Item ┆ Price ┆ Amount ┆ Assigned amount │
│ ---  ┆ ---  ┆ ---   ┆ ---    ┆ ---             │
│ str  ┆ str  ┆ i64   ┆ i64    ┆ i64             │
╞══════╪══════╪═══════╪════════╪═════════════════╡
│ A    ┆ x1   ┆ 40    ┆ 100    ┆ 40              │
│ A    ┆ x2   ┆ 60    ┆ 100    ┆ 60              │
│ B    ┆ y1   ┆ 50    ┆ 300    ┆ 50              │
│ B    ┆ y2   ┆ 150   ┆ 300    ┆ 150             │
│ B    ┆ y3   ┆ 200   ┆ 300    ┆ 100             │
│ C    ┆ z    ┆ 400   ┆ 250    ┆ 250             │
└──────┴──────┴───────┴────────┴─────────────────┘

In this example, we can distribute the amounts for A, so that they are the same as the price. However, for the last item of B and for C we write the remaining amounts as the prices are too high.

Is there an efficient way to do this?

My initial solution was to calculate the cumulative sum of the Price in a new column in the target dataframe, then left join the source DataFrame and subtract the values of the cumulative sum. This would work if the amount is high enough, but for the last item of B and C I would get negative values and not the remaining amount.


Solution

  • @jqurious, good answer. This might be slightly more succinct:

    (
        df2.join(df1, on='Name')
        .with_columns(
            pl.min_horizontal(
                pl.col('Price'),
                pl.col('Amount') -
                pl.col('Price').cum_sum().shift(1, fill_value=0).over('Name')
            )
            .clip(0)
            .alias('assigned')
        )
    )
    
    shape: (6, 5)
    ┌──────┬──────┬───────┬────────┬──────────┐
    │ Name ┆ Item ┆ Price ┆ Amount ┆ assigned │
    │ ---  ┆ ---  ┆ ---   ┆ ---    ┆ ---      │
    │ str  ┆ str  ┆ i64   ┆ i64    ┆ i64      │
    ╞══════╪══════╪═══════╪════════╪══════════╡
    │ A    ┆ x1   ┆ 40    ┆ 100    ┆ 40       │
    │ A    ┆ x2   ┆ 60    ┆ 100    ┆ 60       │
    │ B    ┆ y1   ┆ 50    ┆ 300    ┆ 50       │
    │ B    ┆ y2   ┆ 150   ┆ 300    ┆ 150      │
    │ B    ┆ y3   ┆ 200   ┆ 300    ┆ 100      │
    │ C    ┆ z    ┆ 400   ┆ 250    ┆ 250      │
    └──────┴──────┴───────┴────────┴──────────┘