I have a complex computation logic, which requires me to create a new Dataframe from inputs to the apply_multiple in order to leverage on DataFrame functionality such as filter, Groupby, Aggregate etc for that new DF (inside wider Groupby Aggregate context). Here is a drastically simplified example. Notice, you'd need feature "ndarray" to reproduce the math logic. First, I aggregate my df by a column called Region.
use polars::prelude::*;
use polars::df;
use ndarray::prelude::*;
use ndarray::stack;
pub fn main() {
let df = df! [
"Region" => ["EU", "EU", "EU", "EU", "US", "US", "US", "US"],
"Month" => ["NOV", "DEC", "APR", "APR", "JUL", "JAN", "JUL", "SEP"],
"Weight" => [1, 2, 3, 4, 5,6,7,8],
"Scalar" => [Some(0.3), None, Some(0.1), Some(0.1), Some(0.2), None, Some(0.1), Some(0.3)]
].unwrap();
let df1 = df.clone().lazy()
.groupby_stable([col("Region")])
.agg( [
// WHAT I WANT TO DO: if .first() then just returns nulls
region_health_index().first().alias("RegionHealth"),
// THIS THROWS AN ERROR: if .sum() then error
//region_health_index().sum().alias("RegionHealth2"),
]
)
.collect()
.unwrap();
dbg!(df1);
}
The df looks like this:
Region ┆ Month ┆ Weight ┆ Scalar │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i32 ┆ f64 │
╞════════╪═══════╪════════╪════════╡
│ EU ┆ NOV ┆ 1 ┆ 0.3 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ EU ┆ DEC ┆ 2 ┆ null │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ EU ┆ APR ┆ 3 ┆ 0.1 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ EU ┆ APR ┆ 4 ┆ 0.1 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ US ┆ JUL ┆ 5 ┆ 0.2 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ US ┆ JAN ┆ 6 ┆ null │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ US ┆ JUL ┆ 7 ┆ 0.1 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ US ┆ SEP ┆ 8 ┆ 0.3 │
Calculate scaled weight:
/// This function is for completeness of example only
pub fn weight_scaled() -> Expr {
col("Weight") * col("Scalar")
}
Here is the issue, I need to aggregate WeightScaled by month, without the nulls and then perform a complex mathematical computation. After aggregation for EU I'd like to get a series of len 2: [0.3; 0.7] - that is 1 * 0.3 and 0.1 * 3+0.1 * 0.4 And similarly for US: [1.7; 2.4]
Another caveat is that in practice I will have more than one column.
After that I perform complex computation, which requires me to drop those months where WeightScaled is null.
My attempt:
/// This function is supposed to return a SCALAR, per Group,
/// ie it would be lit::<f64>()
pub fn region_health_index() -> Expr {
apply_multiple(|columns| {
// Step 1. Here, I need to aggregate by Month
let df = DataFrame::new(vec![
columns[0].clone(), columns[1].clone()
])? // fails here
.lazy()
.groupby_stable([col("Month")])
.agg([
col("WeightScaled").sum().alias("WeightScaledSumed")
])
.collect()?;
// complex computation here
Ok( Series::new("result", &[res]) )
},
&[col("Month"), weight_scaled().alias("WeightScaled")],
GetOutput::from_type(DataType::Float64))
}
The error, if I call region_health_index().sum() is:
thread 'thread '' panicked at '' panicked at 'assertion failed: idx.len() <= self.len()assertion failed: idx.len() <= self.len()', ', C:\Users\Anato.cargo\registry\src\github.com-1ecc6299db9ec823\polars-core-0.22.3\src\frame\groupby\aggregations\mod.rsC:\Users\Anato.cargo\registry\src\github.com-1ecc6299db9ec823\polars-core-0.22.3\src\frame\groupby\aggregations\mod.rs::534534::1717
res_m for EU group here would be
-1.7 -1.3
-1.7 -1.3
and res is the sum of these numbers which is -6. these are just dummy values for reproducing the issue.Similarly, for US group. Hence, final result would be something like:
───────┬───────┬──
│ Region ┆ RegionHealth
│ --- ┆ --- ┆
│ str ┆ f64 ┆
╞════════╪═══════╪
│ EU ┆ -6 ┆
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌
│ US ┆ 0.82
Solved using the df! macro inside the apply_multiple:
let _df = df![
"a" => columns[0].clone(),
"b" => columns[1].clone(),
]?
Not sure why the original way of creation df didn't work, might be a bug.