This is the code , here i'm calculating a column named Total_SKU_Count, i want the column value as 1 for first occurrences of unique subset else 0.
Let's consider this dummy data set,
import pandas as pd
data = {
'store': ['A', 'A', 'B', 'B', 'C', 'C', 'A', 'B', 'C'],
'item': ['X', 'Y', 'X', 'Y', 'X', 'Y', 'X', 'Y', 'X'],
'date': ['2023-01-01', '2023-01-01', '2023-01-01', '2023-01-01', '2023-01-01', '2023-01-01', '2023-01-02', '2023-01-02', '2023-01-02'],
'Sold_Qty_LY': [10, 20, 15, 25, 30, 40, 10, 15, 20],
}
df = pd.DataFrame(data)
Now for the below subset
subset = ['store', 'item', 'date']
df['Total_SKU_Count'] = (~df.duplicated(subset=subset, keep='first')).astype(int)
This is the output
store item date Sold_Qty_LY Total_SKU_Count
0 A X 2023-01-01 10 1
1 A Y 2023-01-01 20 1
2 B X 2023-01-01 15 1
3 B Y 2023-01-01 25 1
4 C X 2023-01-01 30 1
5 C Y 2023-01-01 40 1
6 A X 2023-01-02 10 1
7 B Y 2023-01-02 15 1
8 C X 2023-01-02 20 1
For this one,
subset = ['store', 'item']
df['Total_SKU_Count'] = (~df.duplicated(subset=subset, keep='first')).astype(int)
This is the output
store item date Sold_Qty_LY Total_SKU_Count
0 A X 2023-01-01 10 1
1 A Y 2023-01-01 20 1
2 B X 2023-01-01 15 1
3 B Y 2023-01-01 25 1
4 C X 2023-01-01 30 1
5 C Y 2023-01-01 40 1
6 A X 2023-01-02 10 0
7 B Y 2023-01-02 15 0
8 C X 2023-01-02 20 0
If you observe the later output you can clearly understand the problem.
I've found this solution on stackoverflow, Polars - drop duplicate row based on column subset but keep first, but in my case i can't drop any rows, and this is not useful for me.Since polars is new for me i'm having a hard time to solve this problem.Please help me if there is a way to achieve this.Your support is much appreciated.
All you need is
dfpl.with_columns(Total_SKU_Count=pl.struct('store', 'item').is_first_distinct())
What this does:
Struct
column from 'store'
and 'item'
is_first_distinct
on that columnIf you want your exact output, you may want to cast to pl.Int32
as well, i.e.
(
dfpl.with_columns(
Total_SKU_Count=pl.struct("store", "item").is_first_distinct().cast(pl.Int32)
)
)
shape: (9, 5)
┌───────┬──────┬────────────┬─────────────┬─────────────────┐
│ store ┆ item ┆ date ┆ Sold_Qty_LY ┆ Total_SKU_Count │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ i64 ┆ i32 │
╞═══════╪══════╪════════════╪═════════════╪═════════════════╡
│ A ┆ X ┆ 2023-01-01 ┆ 10 ┆ 1 │
│ A ┆ Y ┆ 2023-01-01 ┆ 20 ┆ 1 │
│ B ┆ X ┆ 2023-01-01 ┆ 15 ┆ 1 │
│ B ┆ Y ┆ 2023-01-01 ┆ 25 ┆ 1 │
│ C ┆ X ┆ 2023-01-01 ┆ 30 ┆ 1 │
│ C ┆ Y ┆ 2023-01-01 ┆ 40 ┆ 1 │
│ A ┆ X ┆ 2023-01-02 ┆ 10 ┆ 0 │
│ B ┆ Y ┆ 2023-01-02 ┆ 15 ┆ 0 │
│ C ┆ X ┆ 2023-01-02 ┆ 20 ┆ 0 │
└───────┴──────┴────────────┴─────────────┴─────────────────┘