Search code examples
pandasdataframerustrust-polars

Compare two Pola-rs dataframes by position


Suppose I have two dataframes like:

let df_1 = df! {
        "1" => [1,   2,   2,   3,   4,   3],
        "2" => [1,   4,   2,   3,   4,   3],
        "3" => [1,   2,   6,   3,   4,   3],
    }
    .unwrap();

    let mut df_2 = df_1.clone();
    for idx in 0..df_2.width() {
        df_2.apply_at_idx(idx, |s| {
            s.cummax(false)
                .shift(1)
                .fill_null(FillNullStrategy::Zero)
                .unwrap()
        })
        .unwrap();
    }

    println!("{:#?}", df_1);
    println!("{:#?}", df_2);

shape: (6, 3)
┌─────┬─────┬─────┐
│ 1   ┆ 2   ┆ 3   │
│ --- ┆ --- ┆ --- │
│ i32 ┆ i32 ┆ i32 │
╞═════╪═════╪═════╡
│ 1   ┆ 1   ┆ 1   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 2   ┆ 4   ┆ 2   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 2   ┆ 2   ┆ 6   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 3   ┆ 3   ┆ 3   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 4   ┆ 4   ┆ 4   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 3   ┆ 3   ┆ 3   │
└─────┴─────┴─────┘
shape: (6, 3)
┌─────┬─────┬─────┐
│ 1   ┆ 2   ┆ 3   │
│ --- ┆ --- ┆ --- │
│ i32 ┆ i32 ┆ i32 │
╞═════╪═════╪═════╡
│ 0   ┆ 0   ┆ 0   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 1   ┆ 1   ┆ 1   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 2   ┆ 4   ┆ 2   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 2   ┆ 4   ┆ 6   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 3   ┆ 4   ┆ 6   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 4   ┆ 4   ┆ 6   │
└─────┴─────┴─────┘

and I want to compare them such that I end up with a boolean dataframe I can use as a predicate for a selection and aggregation:

shape: (6, 3)
┌───────┬───────┬───────┐
│ 1     ┆ 2     ┆ 3     │
│ ---   ┆ ---   ┆ ---   │
│ bool  ┆ bool  ┆ bool  │
╞═══════╪═══════╪═══════╡
│ true  ┆ true  ┆ true  │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ true  ┆ true  ┆ true  │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ true  ┆ false ┆ true  │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ true  ┆ false ┆ false │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ true  ┆ true  ┆ false │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ false ┆ false ┆ false │
└───────┴───────┴───────┘

In Python Pandas I might do df.where(df_1.ge(df_2)).sum().sum(). What's the idiomatic way to do that with Rust Pola-rs?


Solution

  • <edit>

    If you actually have a single dataframe you can do:

    let mask = 
        when(all().gt_eq(
                all().cummax(false).shift(1).fill_null(0)))
        .then(all())
        .otherwise(lit(NULL));
    
    let out = 
        df_1.lazy().select(&[mask])
        //.sum()
        .collect();
    

    </edit>


    From https://stackoverflow.com/a/72899438

    Masking out values by columns in another DataFrame is a potential for errors caused by different lengths. For this reason polars does not encourage such operations

    It appears the recommended way is to add a suffix to one of the dataframes, "concat" them and use when/then/otherwise.

    .with_context() has been added since that answer which can be used to access both dataframes.

    In Python:

    df1.lazy().with_context(
       df2.lazy().select(pl.all().suffix("_right"))
    ).select([
       pl.when(pl.col(name) >= pl.col(f"{name}_right"))
         .then(pl.col(name)) 
       for name in df1.columns
    ]).collect()
    

    I've not used rust - but my attempt at a translation:

    let mask = 
       df_1.get_column_names().iter().map(|name| 
          when(col(name).gt_eq(col(&format!("{name}_right"))))
          .then(col(name))
          .otherwise(lit(NULL))
       ).collect::<Vec<Expr>>();
    
    let out = 
        df_1.lazy()
        .with_context(&[
            df_2.lazy().select(&[all().suffix("_right")])
        ])
        .select(&mask)
        //.sum()
        .collect();
    
    println!("{:#?}", out);
    

    Output:

    Ok(shape: (6, 3)
    ┌──────┬──────┬──────┐
    │ 1    ┆ 2    ┆ 3    │
    │ ---  ┆ ---  ┆ ---  │
    │ i32  ┆ i32  ┆ i32  │
    ╞══════╪══════╪══════╡
    │ 1    ┆ 1    ┆ 1    │
    ├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
    │ 2    ┆ 4    ┆ 2    │
    ├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
    │ 2    ┆ null ┆ 6    │
    ├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
    │ 3    ┆ null ┆ null │
    ├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
    │ 4    ┆ 4    ┆ null │
    ├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
    │ null ┆ null ┆ null │
    └──────┴──────┴──────┘)