Search code examples
dataframepython-polars

Replacing multiple rows with polars , based on filter condition /equivalent to df.loc of pandas


consider this dummy dataset,

import numpy as np
import pandas as pd


np.random.seed(25)
num_rows = 6
data = {
    'item_id': np.random.choice(['A', 'B'], num_rows),
    'store_id': np.random.choice([1, 2], num_rows),
    'sold_quantity': np.random.randint(0, 5, num_rows),  
    'total_sku_count': np.random.choice([0, 1], num_rows),
    'netsales': np.random.choice([50,100], num_rows)
    
}
df = pd.DataFrame(data)

Main dataframe,


item_id store_id    sold_quantity   total_sku_count netsales
0   A   2   4   0   100
1   A   2   4   1   50
2   A   1   1   1   100
3   B   1   4   0   50
4   B   2   1   1   100
5   A   1   3   1   100

Now i'm creating a filtered df, with rows containing only "item_id"= "A" using ".iloc".Then, i'll change 'netsales' value to 120 for all rows,

sub_df = df.loc[df["item_id"]=="A"] 
sub_df['netsales'] = 120

This is the filtered df with changed value,

    item_id store_id    sold_quantity   total_sku_count netsales
0   A   2   4   0   120
1   A   2   4   1   120
2   A   1   1   1   120
5   A   1   3   1   120

Now i can replace the rows of main df ,with filtered df using below line of code on same location.

df.loc[df["item_id"]=="A"] = sub_df

This is the final df,


item_id store_id    sold_quantity   total_sku_count netsales
0   A   2   4   0   120
1   A   2   4   1   120
2   A   1   1   1   120
3   B   1   4   0   50
4   B   2   1   1   100
5   A   1   3   1   120

I actually wanted to perform the same operation in Polars. I've attempted to implement this using the 'filter' method of Polars, but I was unsuccessful due to my inexperience with the polars.Please help me if there is a way to achieve this.Your support is much appreciated.


Solution

  • There are a couple ways that you can accomplish your task. IIUC there might be two questions here:

    1. How do you perform the specific action you want.
    2. How do we seemingly update a DataFrame

    1. Replace values in a single column based on condition

    import numpy as np
    import polars as pl
    
    np.random.seed(25)
    num_rows = 6
    data = {
        'item_id': np.random.choice(['A', 'B'], num_rows),
        'store_id': np.random.choice([1, 2], num_rows),
        'sold_quantity': np.random.randint(0, 5, num_rows),
        'total_sku_count': np.random.choice([0, 1], num_rows),
        'netsales': np.random.choice([50,100], num_rows)
    
    }
    pl_df = pl.DataFrame(data).lazy() # We'll use a LazyFrame
    
    pl_df = pl_df.with_columns(
        netsales=pl.when(pl.col('item_id') == 'A').then(120).otherwise(pl.col('netsales'))
    )
    
    print(pl_df.collect())
    # shape: (6, 5)
    # ┌─────────┬──────────┬───────────────┬─────────────────┬──────────┐
    # │ item_id ┆ store_id ┆ sold_quantity ┆ total_sku_count ┆ netsales │
    # │ ---     ┆ ---      ┆ ---           ┆ ---             ┆ ---      │
    # │ str     ┆ i64      ┆ i64           ┆ i64             ┆ i64      │
    # ╞═════════╪══════════╪═══════════════╪═════════════════╪══════════╡
    # │ A       ┆ 2        ┆ 4             ┆ 0               ┆ 120      │
    # │ A       ┆ 2        ┆ 4             ┆ 1               ┆ 120      │
    # │ A       ┆ 1        ┆ 1             ┆ 1               ┆ 120      │
    # │ B       ┆ 1        ┆ 4             ┆ 0               ┆ 50       │
    # │ B       ┆ 2        ┆ 1             ┆ 1               ┆ 100      │
    # │ A       ┆ 1        ┆ 3             ┆ 1               ┆ 120      │
    # └─────────┴──────────┴───────────────┴─────────────────┴──────────┘
    

    The above complets your operation, however you might also be interested in changing many columns above in accordance with a subframe.

    2. "Replacing" a subframe: filter, update

    You can partition your DataFrame based on your computed mask, where the in-group is df.filter(mask) and the remaining group would be df.filter(~mask). This way you can make changes to the in-group and the concatenate the two parts back together. We’ll also need to track the row numbers to restore the original ordering of our rows.

    import numpy as np
    import polars as pl
    
    np.random.seed(25)
    num_rows = 6
    data = {
        'item_id': np.random.choice(['A', 'B'], num_rows),
        'store_id': np.random.choice([1, 2], num_rows),
        'sold_quantity': np.random.randint(0, 5, num_rows),
        'total_sku_count': np.random.choice([0, 1], num_rows),
        'netsales': np.random.choice([50,100], num_rows)
    
    }
    pl_df = pl.DataFrame(data).lazy().with_row_count()
    
    mask = pl.col('item_id') == 'A'
    sub_df = (
        pl_df.filter(mask)
        .with_columns(netsales=pl.lit(120))
        .cast(pl_df.schema)
    )
    remaining_df = pl_df.filter(~mask)
    
    # Reconstitute original DataFrame with new chunk
    pl_df = pl.concat([remaining_df, sub_df]).sort(pl.col('row_nr'))
    
    print(
        pl_df.collect(),
    )
    # shape: (6, 6)
    # ┌────────┬─────────┬──────────┬───────────────┬─────────────────┬──────────┐
    # │ row_nr ┆ item_id ┆ store_id ┆ sold_quantity ┆ total_sku_count ┆ netsales │
    # │ ---    ┆ ---     ┆ ---      ┆ ---           ┆ ---             ┆ ---      │
    # │ u32    ┆ str     ┆ i64      ┆ i64           ┆ i64             ┆ i64      │
    # ╞════════╪═════════╪══════════╪═══════════════╪═════════════════╪══════════╡
    # │ 0      ┆ A       ┆ 2        ┆ 4             ┆ 0               ┆ 120      │
    # │ 1      ┆ A       ┆ 2        ┆ 4             ┆ 1               ┆ 120      │
    # │ 2      ┆ A       ┆ 1        ┆ 1             ┆ 1               ┆ 120      │
    # │ 3      ┆ B       ┆ 1        ┆ 4             ┆ 0               ┆ 50       │
    # │ 4      ┆ B       ┆ 2        ┆ 1             ┆ 1               ┆ 100      │
    # │ 5      ┆ A       ┆ 1        ┆ 3             ┆ 1               ┆ 120      │
    # └────────┴─────────┴──────────┴───────────────┴─────────────────┴──────────┘
    

    3. "Replacing" a subframe: filter, update

    Polars also has a DataFrame.update method which is used to mimic an in-place update. This takes a similar approach as we did above, except it utilizes a .join(…)….coalesce(…) under the hood instead of concat.

    import numpy as np
    import polars as pl
    
    np.random.seed(25)
    num_rows = 6
    data = {
        'item_id': np.random.choice(['A', 'B'], num_rows),
        'store_id': np.random.choice([1, 2], num_rows),
        'sold_quantity': np.random.randint(0, 5, num_rows),
        'total_sku_count': np.random.choice([0, 1], num_rows),
        'netsales': np.random.choice([50,100], num_rows)
    
    }
    pl_df = pl.DataFrame(data).lazy().with_row_count()
    
    sub_df = (
        pl_df.filter(pl.col('item_id') == 'A')
        .with_columns(netsales=pl.lit(120))
    )
    
    print(
        pl_df.update(sub_df, on='row_nr').collect()
    )
    # shape: (6, 6)
    # ┌────────┬─────────┬──────────┬───────────────┬─────────────────┬──────────┐
    # │ row_nr ┆ item_id ┆ store_id ┆ sold_quantity ┆ total_sku_count ┆ netsales │
    # │ ---    ┆ ---     ┆ ---      ┆ ---           ┆ ---             ┆ ---      │
    # │ u32    ┆ str     ┆ i64      ┆ i64           ┆ i64             ┆ i64      │
    # ╞════════╪═════════╪══════════╪═══════════════╪═════════════════╪══════════╡
    # │ 0      ┆ A       ┆ 2        ┆ 4             ┆ 0               ┆ 120      │
    # │ 1      ┆ A       ┆ 2        ┆ 4             ┆ 1               ┆ 120      │
    # │ 2      ┆ A       ┆ 1        ┆ 1             ┆ 1               ┆ 120      │
    # │ 3      ┆ B       ┆ 1        ┆ 4             ┆ 0               ┆ 50       │
    # │ 4      ┆ B       ┆ 2        ┆ 1             ┆ 1               ┆ 100      │
    # │ 5      ┆ A       ┆ 1        ┆ 3             ┆ 1               ┆ 120      │
    # └────────┴─────────┴──────────┴───────────────┴─────────────────┴──────────┘