Search code examples
rustrust-polars

fill_null in LazyFrame equivalent to strategy on DataFrames


I've been trying to use Lazyframes instead of Dataframes more often due to performance reasons. Unfortunately, not all features available in DataFrames are available for LazyFrames, one of these being the .fill_null method, that takes a FillNullStrategy in the DataFrame's method, but simply a generic E where E: Into<Expr>.

Today, I've tried extensively to replicate the same behavior of using a FillNullStrategy for LazyFrame to no avail with something like this:

lf.fill_null(
            when(col("*").is_null())
                .then(col("*").shift(Some(1)))
                .otherwise(col(name)),
        )

That didn't work when .collect()ing the LazyFrame, though. I've noticed that we have such feature in Polars Python (docs), but not for Rust. As I assume Polars team wouldn't expose such functionality simply by .collect()ing the LazyFrame and then .lazy()ing it back, I believe I am missing something simpler here.

Does anybody have an insight on this?


Solution

  • This took a little digging, but it looks like the Python version also exposes explicit methods for some fill strategies. It looks like these are also exposed in the Rust APIs. Here's the code for documentation: https://github.com/pola-rs/polars/blob/275178c25b4bebf2f2c8a88993d445b5aabc8cc9/polars/polars-lazy/polars-plan/src/dsl/mod.rs#L782

    Here's an example of the 'backwards' strategy:

    let df = DataFrame::new(vec![
        Series::new("data", vec![Some(1.0), None, Some(3.0), Some(4.0)])
    ])
    .unwrap();
    
    let lf = df.lazy().fill_null(col("*").backward_fill(Some(1))).collect();
    
    println!("{:?}", lf);
    

    and the result:

    Ok(shape: (4, 1)
    ┌──────┐
    │ data │
    │ ---  │
    │ f64  │
    ╞══════╡
    │ 1.0  │
    │ 3.0  │
    │ 3.0  │
    │ 4.0  │
    └──────┘)
    

    There is also a forward_fill method available. For setting literal values, you can simply use col("*").lit(value), similarly if you wanted to do max, min, etc, you can use col("*").max() (or .min(), etc).

    If you want to use fill_null directly, passing in a FillNullStrategy, as mentioned by jqurious that is available on Series rather than on DataFrames or Expr. But it looks like you can accomplish most if not all of the strategies using the above approaches.