Search code examples
pythonpython-polars

Row-wise aggregation based on non-zero columns


I am facing a scenario where I have a sparse dataset (for each row, only 4 columns are populated at the same time, the rest will be zeros). For each variable prefix, there will be 3 variables (e.g a_qty, a_height, ma_width), the number of columns is completely arbitrary as new prefixes might be added anytime.

To make an example, the dataframe could resemble something like:

import polars as pl

df = pl.DataFrame(
    {
        "some_id": ["x", "y", "z"],
        "a_qty": [1, 0, 0],
        "a_height": [2, 0, 0],
        "a_width": [3, 0, 0],
        "b_qty": [0, 3, 0],
        "b_height": [0, 5, 0],
        "b_width": [0, 8, 0],
        "c_qty": [0, 0, 10],
        "c_height": [0, 0, 11],
        "c_width": [0, 0, 12],
    }
)
# prints
┌─────────┬───────┬──────────┬─────────┬───┬─────────┬───────┬──────────┬─────────┐
│ some_id ┆ a_qty ┆ a_height ┆ a_width ┆ … ┆ b_width ┆ c_qty ┆ c_height ┆ c_width │
│ ---     ┆ ---   ┆ ---      ┆ ---     ┆   ┆ ---     ┆ ---   ┆ ---      ┆ ---     │
│ str     ┆ i64   ┆ i64      ┆ i64     ┆   ┆ i64     ┆ i64   ┆ i64      ┆ i64     │
╞═════════╪═══════╪══════════╪═════════╪═══╪═════════╪═══════╪══════════╪═════════╡
│ x       ┆ 1     ┆ 2        ┆ 3       ┆ … ┆ 0       ┆ 0     ┆ 0        ┆ 0       │
│ y       ┆ 0     ┆ 0        ┆ 0       ┆ … ┆ 8       ┆ 0     ┆ 0        ┆ 0       │
│ z       ┆ 0     ┆ 0        ┆ 0       ┆ … ┆ 0       ┆ 10    ┆ 11       ┆ 12      │
└─────────┴───────┴──────────┴─────────┴───┴─────────┴───────┴──────────┴─────────┘

I am looking for a way to narrow down the dataset by reducing the number of columns and getting rid of the 0s for a given row. Each row of the target dataframe should contain:

  1. qty: the sum of the columns containing qty for a given row (same applies to width and height)
  2. is_prod_* a dummy variable set to 1 depending on the prefix of the non-zero columns. Prefixes won't mix (only a single letter per row)
df_target = pl.DataFrame(
    {
        "some_id": ["x", "y", "z"],
        "height": [2, 5, 11],
        "width": [3, 8, 12],
        "qty": [1, 3, 10],
        "is_prod_a": [1, 0, 0],
        "is_prod_b": [0, 1, 0],
        "is_prod_c": [0, 0, 1]
    }
)
# prints
┌─────────┬────────┬───────┬─────┬────────┬────────┬────────┐
│ some_id ┆ height ┆ width ┆ qty ┆ prod_a ┆ prod_b ┆ prod_c │
│ ---     ┆ ---    ┆ ---   ┆ --- ┆ ---    ┆ ---    ┆ ---    │
│ str     ┆ i64    ┆ i64   ┆ i64 ┆ i64    ┆ i64    ┆ i64    │
╞═════════╪════════╪═══════╪═════╪════════╪════════╪════════╡
│ x       ┆ 2      ┆ 3     ┆ 1   ┆ 1      ┆ 0      ┆ 0      │
│ y       ┆ 5      ┆ 8     ┆ 3   ┆ 0      ┆ 1      ┆ 0      │
│ z       ┆ 11     ┆ 12    ┆ 10  ┆ 0      ┆ 0      ┆ 1      │
└─────────┴────────┴───────┴─────┴────────┴────────┴────────┘

I have tried grouping by row number and summing variables by type:

width_cols = [col for col in df_test.columns if "width" in col],
height_cols = [col for col in df_test.columns if "height" in col]
qty_cols = [col for col in df_test.columns if "qty" in col]

(
    df
    .with_row_index()
    .group_by("index")
    .agg(
        pl.col(*width_cols).sum().alias('width')
    )
)

# DuplicateError: column with name 'width' has more than one occurrence

This is clearly not working as I am trying to create a duplicate column on the second row. What would be the most elegant way to achieve such aggregation?

Edit 1

It is possible to get the row-wise column sums using pl.DataFrame.fold which seems to do the trick. Still unsure on how to create the dummy column named base on a condition.

(
    df_test
    .with_columns(
        pl.fold(0, lambda acc, s: acc + s, pl.col(*width_cols)).alias("width"),
        pl.fold(0, lambda acc, s: acc + s, pl.col(*height_cols)).alias("height"),
        pl.fold(0, lambda acc, s: acc + s, pl.col(*qty_cols)).alias("qty")
    )
)

Solution

  • A sum_horizontal does work for part of the problem, but only because 0 is being used as essentially a null value. There's not really an aggregation to be done, just taking the (only) non-null value per observation.

    In fact, the input dataframe for this problem can be defined as one identifier column, some_id, and the remaining columns are observables. This perfectly fits what unpivot helps do:

    df.unpivot(index='some_id')
    
    shape: (27, 3)
    ┌─────────┬──────────┬───────┐
    │ some_id ┆ variable ┆ value │
    │ ---     ┆ ---      ┆ ---   │
    │ str     ┆ str      ┆ i64   │
    ╞═════════╪══════════╪═══════╡
    │ x       ┆ a_qty    ┆ 1     │
    │ y       ┆ a_qty    ┆ 0     │
    │ z       ┆ a_qty    ┆ 0     │
    │ x       ┆ a_height ┆ 2     │
    │ …       ┆ …        ┆ …     │
    │ z       ┆ c_height ┆ 11    │
    │ x       ┆ c_width  ┆ 0     │
    │ y       ┆ c_width  ┆ 0     │
    │ z       ┆ c_width  ┆ 12    │
    └─────────┴──────────┴───────┘
    

    Thus, in general, this is an unpivot and pivot back problem after some transformations, mainly taking out the irrelevant rows:

    df_out = (
        df
        .unpivot(index='some_id')
        .filter(pl.col('value') > 0)
        .select(
            pl.exclude('variable'),
            pl.col('variable').str.extract('_(\w+)$'),
            pl.col('variable').str.extract('^(\w)_').alias('prefix')
        )
        .pivot(on='variable', index=['some_id', 'prefix'])
    )
    
    shape: (3, 5)
    ┌─────────┬────────┬─────┬────────┬───────┐
    │ some_id ┆ prefix ┆ qty ┆ height ┆ width │
    │ ---     ┆ ---    ┆ --- ┆ ---    ┆ ---   │
    │ str     ┆ str    ┆ i64 ┆ i64    ┆ i64   │
    ╞═════════╪════════╪═════╪════════╪═══════╡
    │ x       ┆ a      ┆ 1   ┆ 2      ┆ 3     │
    │ y       ┆ b      ┆ 3   ┆ 5      ┆ 8     │
    │ z       ┆ c      ┆ 10  ┆ 11     ┆ 12    │
    └─────────┴────────┴─────┴────────┴───────┘
    

    I think a prefix column encapsulates that data better than [1, 0, 0], [0, 1, 0] etc., but you could nonetheless do that with an extra step:

    df_out.select(
        pl.exclude('prefix'),
        **{
            f'is_prod_{l}' : pl.col('prefix').eq(l).cast(pl.Int64)
            for l in df_out.get_column('prefix')
        }
    )
    
    shape: (3, 7)
    ┌─────────┬─────┬────────┬───────┬───────────┬───────────┬───────────┐
    │ some_id ┆ qty ┆ height ┆ width ┆ is_prod_a ┆ is_prod_b ┆ is_prod_c │
    │ ---     ┆ --- ┆ ---    ┆ ---   ┆ ---       ┆ ---       ┆ ---       │
    │ str     ┆ i64 ┆ i64    ┆ i64   ┆ i64       ┆ i64       ┆ i64       │
    ╞═════════╪═════╪════════╪═══════╪═══════════╪═══════════╪═══════════╡
    │ x       ┆ 1   ┆ 2      ┆ 3     ┆ 1         ┆ 0         ┆ 0         │
    │ y       ┆ 3   ┆ 5      ┆ 8     ┆ 0         ┆ 1         ┆ 0         │
    │ z       ┆ 10  ┆ 11     ┆ 12    ┆ 0         ┆ 0         ┆ 1         │
    └─────────┴─────┴────────┴───────┴───────────┴───────────┴───────────┘
    

    (EDIT: As @jqurious pointed out, this step can be very simplified with to_dummies('prefix'). Thanks for the catch!)

    This full approach has the benefit of allowing for any type of null / unused value other than 0, and not setting up any temporary variables like column names, or even using df.columns (although I use df_out in the last step - might be easier to just have a full list of prefixes at that point).