Search code examples
dataframerustrust-polars

Rust polars: Create DataFrame and groupby and Aggregate on it inside apply_multiple() (ie inside another groupby context)?


I have a complex computation logic, which requires me to create a new Dataframe from inputs to the apply_multiple in order to leverage on DataFrame functionality such as filter, Groupby, Aggregate etc for that new DF (inside wider Groupby Aggregate context). Here is a drastically simplified example. Notice, you'd need feature "ndarray" to reproduce the math logic. First, I aggregate my df by a column called Region.

use polars::prelude::*;
use polars::df;
use ndarray::prelude::*;
use ndarray::stack;

pub fn main() {
    let df = df! [
        "Region" => ["EU", "EU", "EU", "EU",                "US", "US", "US", "US"],
        "Month" => ["NOV", "DEC", "APR", "APR",             "JUL", "JAN", "JUL", "SEP"],
        "Weight" => [1, 2, 3, 4,                             5,6,7,8],
        "Scalar" => [Some(0.3), None, Some(0.1), Some(0.1),   Some(0.2), None, Some(0.1), Some(0.3)]
    ].unwrap();

    let df1 = df.clone().lazy()
        .groupby_stable([col("Region")])
        .agg( [
            // WHAT I WANT TO DO: if .first() then just returns nulls
            region_health_index().first().alias("RegionHealth"),
            // THIS THROWS AN ERROR: if .sum() then error
            //region_health_index().sum().alias("RegionHealth2"),
        ]
        )
        .collect()
        .unwrap();
    dbg!(df1);
}

The df looks like this:

Region ┆ Month ┆ Weight ┆ Scalar │
│ ---    ┆ ---   ┆ ---    ┆ ---    │
│ str    ┆ str   ┆ i32    ┆ f64    │
╞════════╪═══════╪════════╪════════╡
│ EU     ┆ NOV   ┆ 1      ┆ 0.3    │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ EU     ┆ DEC   ┆ 2      ┆ null   │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ EU     ┆ APR   ┆ 3      ┆ 0.1    │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ EU     ┆ APR   ┆ 4      ┆ 0.1    │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ US     ┆ JUL   ┆ 5      ┆ 0.2    │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ US     ┆ JAN   ┆ 6      ┆ null   │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ US     ┆ JUL   ┆ 7      ┆ 0.1    │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ US     ┆ SEP   ┆ 8      ┆ 0.3    │

Calculate scaled weight:

/// This function is for completeness of example only
pub fn weight_scaled() -> Expr {
    col("Weight") * col("Scalar")
}

Here is the issue, I need to aggregate WeightScaled by month, without the nulls and then perform a complex mathematical computation. After aggregation for EU I'd like to get a series of len 2: [0.3; 0.7] - that is 1 * 0.3 and 0.1 * 3+0.1 * 0.4 And similarly for US: [1.7; 2.4]

Another caveat is that in practice I will have more than one column.

After that I perform complex computation, which requires me to drop those months where WeightScaled is null.

My attempt:

/// This function is supposed to return a SCALAR, per Group,
/// ie it would be lit::<f64>()
pub fn region_health_index() -> Expr {
    apply_multiple(|columns| {
        // Step 1. Here, I need to aggregate by Month
        let df = DataFrame::new(vec![
            columns[0].clone(), columns[1].clone()
        ])? // fails here
        
        .lazy()
        .groupby_stable([col("Month")])
        .agg([
            col("WeightScaled").sum().alias("WeightScaledSumed")
        ])
        .collect()?;

        // complex computation here  

        Ok( Series::new("result", &[res]) )
        
    }, 
        &[col("Month"), weight_scaled().alias("WeightScaled")], 
        GetOutput::from_type(DataType::Float64))
}

The error, if I call region_health_index().sum() is:

thread 'thread '' panicked at '' panicked at 'assertion failed: idx.len() <= self.len()assertion failed: idx.len() <= self.len()', ', C:\Users\Anato.cargo\registry\src\github.com-1ecc6299db9ec823\polars-core-0.22.3\src\frame\groupby\aggregations\mod.rsC:\Users\Anato.cargo\registry\src\github.com-1ecc6299db9ec823\polars-core-0.22.3\src\frame\groupby\aggregations\mod.rs::534534::1717

res_m for EU group here would be

-1.7    -1.3
-1.7    -1.3

and res is the sum of these numbers which is -6. these are just dummy values for reproducing the issue.Similarly, for US group. Hence, final result would be something like:

───────┬───────┬──
│ Region ┆ RegionHealth
│ ---    ┆ ---   ┆
│ str    ┆ f64   ┆ 
╞════════╪═══════╪
│ EU     ┆ -6    ┆ 
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌
│ US     ┆ 0.82 

Solution

  • Solved using the df! macro inside the apply_multiple:

    let _df = df![
                "a" => columns[0].clone(),
                "b" => columns[1].clone(),
            ]?
    

    Not sure why the original way of creation df didn't work, might be a bug.