I am facing a scenario where I have a sparse dataset (for each row, only 4 columns are populated at the same time, the rest will be zeros). For each variable prefix, there will be 3 variables (e.g a_qty
, a_height
, ma_width
), the number of columns is completely arbitrary as new prefixes might be added anytime.
To make an example, the dataframe could resemble something like:
import polars as pl
df = pl.DataFrame(
{
"some_id": ["x", "y", "z"],
"a_qty": [1, 0, 0],
"a_height": [2, 0, 0],
"a_width": [3, 0, 0],
"b_qty": [0, 3, 0],
"b_height": [0, 5, 0],
"b_width": [0, 8, 0],
"c_qty": [0, 0, 10],
"c_height": [0, 0, 11],
"c_width": [0, 0, 12],
}
)
# prints
┌─────────┬───────┬──────────┬─────────┬───┬─────────┬───────┬──────────┬─────────┐
│ some_id ┆ a_qty ┆ a_height ┆ a_width ┆ … ┆ b_width ┆ c_qty ┆ c_height ┆ c_width │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 ┆ ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═════════╪═══════╪══════════╪═════════╪═══╪═════════╪═══════╪══════════╪═════════╡
│ x ┆ 1 ┆ 2 ┆ 3 ┆ … ┆ 0 ┆ 0 ┆ 0 ┆ 0 │
│ y ┆ 0 ┆ 0 ┆ 0 ┆ … ┆ 8 ┆ 0 ┆ 0 ┆ 0 │
│ z ┆ 0 ┆ 0 ┆ 0 ┆ … ┆ 0 ┆ 10 ┆ 11 ┆ 12 │
└─────────┴───────┴──────────┴─────────┴───┴─────────┴───────┴──────────┴─────────┘
I am looking for a way to narrow down the dataset by reducing the number of columns and getting rid of the 0s for a given row. Each row of the target dataframe should contain:
qty
: the sum of the columns containing qty
for a given row (same applies to width
and height
)is_prod_*
a dummy variable set to 1 depending on the prefix of the non-zero columns. Prefixes won't mix (only a single letter per row)df_target = pl.DataFrame(
{
"some_id": ["x", "y", "z"],
"height": [2, 5, 11],
"width": [3, 8, 12],
"qty": [1, 3, 10],
"is_prod_a": [1, 0, 0],
"is_prod_b": [0, 1, 0],
"is_prod_c": [0, 0, 1]
}
)
# prints
┌─────────┬────────┬───────┬─────┬────────┬────────┬────────┐
│ some_id ┆ height ┆ width ┆ qty ┆ prod_a ┆ prod_b ┆ prod_c │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═════════╪════════╪═══════╪═════╪════════╪════════╪════════╡
│ x ┆ 2 ┆ 3 ┆ 1 ┆ 1 ┆ 0 ┆ 0 │
│ y ┆ 5 ┆ 8 ┆ 3 ┆ 0 ┆ 1 ┆ 0 │
│ z ┆ 11 ┆ 12 ┆ 10 ┆ 0 ┆ 0 ┆ 1 │
└─────────┴────────┴───────┴─────┴────────┴────────┴────────┘
I have tried grouping by row number and summing variables by type:
width_cols = [col for col in df_test.columns if "width" in col],
height_cols = [col for col in df_test.columns if "height" in col]
qty_cols = [col for col in df_test.columns if "qty" in col]
(
df
.with_row_index()
.group_by("index")
.agg(
pl.col(*width_cols).sum().alias('width')
)
)
# DuplicateError: column with name 'width' has more than one occurrence
This is clearly not working as I am trying to create a duplicate column on the second row. What would be the most elegant way to achieve such aggregation?
It is possible to get the row-wise column sums using pl.DataFrame.fold
which seems to do the trick. Still unsure on how to create the dummy column named base on a condition.
(
df_test
.with_columns(
pl.fold(0, lambda acc, s: acc + s, pl.col(*width_cols)).alias("width"),
pl.fold(0, lambda acc, s: acc + s, pl.col(*height_cols)).alias("height"),
pl.fold(0, lambda acc, s: acc + s, pl.col(*qty_cols)).alias("qty")
)
)
A sum_horizontal
does work for part of the problem, but only because 0
is being used as essentially a null value. There's not really an aggregation to be done, just taking the (only) non-null value per observation.
In fact, the input dataframe for this problem can be defined as one identifier column, some_id
, and the remaining columns are observables. This perfectly fits what unpivot
helps do:
df.unpivot(index='some_id')
shape: (27, 3)
┌─────────┬──────────┬───────┐
│ some_id ┆ variable ┆ value │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 │
╞═════════╪══════════╪═══════╡
│ x ┆ a_qty ┆ 1 │
│ y ┆ a_qty ┆ 0 │
│ z ┆ a_qty ┆ 0 │
│ x ┆ a_height ┆ 2 │
│ … ┆ … ┆ … │
│ z ┆ c_height ┆ 11 │
│ x ┆ c_width ┆ 0 │
│ y ┆ c_width ┆ 0 │
│ z ┆ c_width ┆ 12 │
└─────────┴──────────┴───────┘
Thus, in general, this is an unpivot
and pivot
back problem after some transformations, mainly taking out the irrelevant rows:
df_out = (
df
.unpivot(index='some_id')
.filter(pl.col('value') > 0)
.select(
pl.exclude('variable'),
pl.col('variable').str.extract('_(\w+)$'),
pl.col('variable').str.extract('^(\w)_').alias('prefix')
)
.pivot(on='variable', index=['some_id', 'prefix'])
)
shape: (3, 5)
┌─────────┬────────┬─────┬────────┬───────┐
│ some_id ┆ prefix ┆ qty ┆ height ┆ width │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 ┆ i64 ┆ i64 │
╞═════════╪════════╪═════╪════════╪═══════╡
│ x ┆ a ┆ 1 ┆ 2 ┆ 3 │
│ y ┆ b ┆ 3 ┆ 5 ┆ 8 │
│ z ┆ c ┆ 10 ┆ 11 ┆ 12 │
└─────────┴────────┴─────┴────────┴───────┘
I think a prefix
column encapsulates that data better than [1, 0, 0]
, [0, 1, 0]
etc., but you could nonetheless do that with an extra step:
df_out.select(
pl.exclude('prefix'),
**{
f'is_prod_{l}' : pl.col('prefix').eq(l).cast(pl.Int64)
for l in df_out.get_column('prefix')
}
)
shape: (3, 7)
┌─────────┬─────┬────────┬───────┬───────────┬───────────┬───────────┐
│ some_id ┆ qty ┆ height ┆ width ┆ is_prod_a ┆ is_prod_b ┆ is_prod_c │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═════════╪═════╪════════╪═══════╪═══════════╪═══════════╪═══════════╡
│ x ┆ 1 ┆ 2 ┆ 3 ┆ 1 ┆ 0 ┆ 0 │
│ y ┆ 3 ┆ 5 ┆ 8 ┆ 0 ┆ 1 ┆ 0 │
│ z ┆ 10 ┆ 11 ┆ 12 ┆ 0 ┆ 0 ┆ 1 │
└─────────┴─────┴────────┴───────┴───────────┴───────────┴───────────┘
(EDIT: As @jqurious pointed out, this step can be very simplified with to_dummies('prefix')
. Thanks for the catch!)
This full approach has the benefit of allowing for any type of null / unused value other than 0
, and not setting up any temporary variables like column names, or even using df.columns
(although I use df_out
in the last step - might be easier to just have a full list of prefixes at that point).