Search code examples
dataframepython-polars

I'm trying to convert my pandas project to polars, i'm having a little trouble in converting df.duplicated function to polars


This is the code , here i'm calculating a column named Total_SKU_Count, i want the column value as 1 for first occurrences of unique subset else 0.

Let's consider this dummy data set,

import pandas as pd
    
data = {
    'store': ['A', 'A', 'B', 'B', 'C', 'C', 'A', 'B', 'C'],
    'item': ['X', 'Y', 'X', 'Y', 'X', 'Y', 'X', 'Y', 'X'],
    'date': ['2023-01-01', '2023-01-01', '2023-01-01', '2023-01-01', '2023-01-01', '2023-01-01', '2023-01-02', '2023-01-02', '2023-01-02'],
    'Sold_Qty_LY': [10, 20, 15, 25, 30, 40, 10, 15, 20],
}

df = pd.DataFrame(data)

Now for the below subset

subset = ['store', 'item', 'date']
df['Total_SKU_Count'] = (~df.duplicated(subset=subset, keep='first')).astype(int)

This is the output

  store item        date  Sold_Qty_LY  Total_SKU_Count
0     A    X  2023-01-01           10                1
1     A    Y  2023-01-01           20                1
2     B    X  2023-01-01           15                1
3     B    Y  2023-01-01           25                1
4     C    X  2023-01-01           30                1
5     C    Y  2023-01-01           40                1
6     A    X  2023-01-02           10                1
7     B    Y  2023-01-02           15                1
8     C    X  2023-01-02           20                1

For this one,

subset = ['store', 'item']
df['Total_SKU_Count'] = (~df.duplicated(subset=subset, keep='first')).astype(int)

This is the output

  store item        date  Sold_Qty_LY  Total_SKU_Count
0     A    X  2023-01-01           10                1
1     A    Y  2023-01-01           20                1
2     B    X  2023-01-01           15                1
3     B    Y  2023-01-01           25                1
4     C    X  2023-01-01           30                1
5     C    Y  2023-01-01           40                1
6     A    X  2023-01-02           10                0
7     B    Y  2023-01-02           15                0
8     C    X  2023-01-02           20                0



If you observe the later output you can clearly understand the problem.

I've found this solution on stackoverflow, Polars - drop duplicate row based on column subset but keep first, but in my case i can't drop any rows, and this is not useful for me.Since polars is new for me i'm having a hard time to solve this problem.Please help me if there is a way to achieve this.Your support is much appreciated.


Solution

  • All you need is

    dfpl.with_columns(Total_SKU_Count=pl.struct('store', 'item').is_first_distinct())
    

    What this does:

    1. create a Struct column from 'store' and 'item'
    2. use is_first_distinct on that column

    If you want your exact output, you may want to cast to pl.Int32 as well, i.e.

    (
        dfpl.with_columns(
            Total_SKU_Count=pl.struct("store", "item").is_first_distinct().cast(pl.Int32)
        )
    )
    
    shape: (9, 5)
    ┌───────┬──────┬────────────┬─────────────┬─────────────────┐
    │ store ┆ item ┆ date       ┆ Sold_Qty_LY ┆ Total_SKU_Count │
    │ ---   ┆ ---  ┆ ---        ┆ ---         ┆ ---             │
    │ str   ┆ str  ┆ str        ┆ i64         ┆ i32             │
    ╞═══════╪══════╪════════════╪═════════════╪═════════════════╡
    │ A     ┆ X    ┆ 2023-01-01 ┆ 10          ┆ 1               │
    │ A     ┆ Y    ┆ 2023-01-01 ┆ 20          ┆ 1               │
    │ B     ┆ X    ┆ 2023-01-01 ┆ 15          ┆ 1               │
    │ B     ┆ Y    ┆ 2023-01-01 ┆ 25          ┆ 1               │
    │ C     ┆ X    ┆ 2023-01-01 ┆ 30          ┆ 1               │
    │ C     ┆ Y    ┆ 2023-01-01 ┆ 40          ┆ 1               │
    │ A     ┆ X    ┆ 2023-01-02 ┆ 10          ┆ 0               │
    │ B     ┆ Y    ┆ 2023-01-02 ┆ 15          ┆ 0               │
    │ C     ┆ X    ┆ 2023-01-02 ┆ 20          ┆ 0               │
    └───────┴──────┴────────────┴─────────────┴─────────────────┘