Search code examples
dataframerustrust-polars

Convert &str to f64 using a Rust Polars custom function


My problem can probably be described as being very new to both Rust and Polars. Go easy on me. :)

I'm trying to establish a pattern using custom functions, based on this documentation: https://pola-rs.github.io/polars-book/user-guide/dsl/custom_functions.html, however am so far unsuccessful.

In my code, I have a function declared as follows:

pub fn convert_str_to_tb(value: &str) -> f64 {
    let value = value.replace(",", "");
    let mut parts = value.split_whitespace();
    let num = parts.next().unwrap().parse::<f64>().unwrap();
    let unit = parts.next().unwrap();

    match unit {
        "KB" => num / (1000.0 * 1000.0 * 1000.0),
        "MB" => num / (1000.0 * 1000.0),
        "GB" => num / 1000.0,
        "TB" => num,
        _ => panic!("Unsupported unit: {}", unit),
    }
}

I believe I should be able to call this function like so:

df.with_columns([
    col("value").map(|s| Ok(convert_str_to_tb(s))).alias("value_tb");
])

My first issue was that with_columns method doesn't seem to exist - I had to use with_column. If I use the with_column, I receive the following error:

the trait bound `Expr: IntoSeries` is not satisfied
the following other types implement trait `IntoSeries`:
  Arc<(dyn polars::prelude::SeriesTrait + 'static)>
  ChunkedArray<T>
  Logical<DateType, Int32Type>
  Logical<DatetimeType, Int64Type>
  Logical<DurationType, Int64Type>
  Logical<TimeType, Int64Type>
  polars::prelude::SeriesrustcClick for full compiler diagnostic

The DataFrame I am trying to transform:

let mut df = df!("volume" => &["volume01", "volume02", "volume03"],
                 "value" => &["1,000 GB", "2,000,000 MB", "3 TB"]).unwrap();

Perhaps there is a way to do this without a custom function?


Solution

  • Problem 1, with_columns

    One confusing note that should be made about the documentation - the df in the example is a lazy data frame. You can see they call .lazy() in the full code snippet where a custom function is used. .with_columns() is an available method on the lazy data frame.

    Problem 2, custom function

    You have some typing issues around what is expected in the custom function and what you have defined. You are expecting a str input and outputting a f64. However, as the error implies the s parameter is actually a Series and the expectation is that the returned value is an Option<Series>.

    So what's happening here? The .map() function is providing you with a series that your custom function needs to iterate over.

    Updating your custom function to have the appropriate arg and return type:

    pub fn convert_str_to_tb(value: Series) -> Option<Series> {
        Some(value.iter().map(|v| {
            let value = v.get_str().unwrap().replace(",", "");
            let mut parts = value.split_whitespace();
            let num = parts.next().unwrap().parse::<f64>().unwrap();
            let unit = parts.next().unwrap();
    
            match unit {
                "KB" => num / (1000.0 * 1000.0 * 1000.0),
                "MB" => num / (1000.0 * 1000.0),
                "GB" => num / 1000.0,
                "TB" => num,
                _ => panic!("Unsupported unit: {}", unit),
            }
        }).collect())
    }
    

    And called using

    df.lazy().with_columns([
        col("value").map(|s| Ok(convert_str_to_tb(s)), GetOutput::default()).alias("value_tb")
    ]).collect().unwrap();
    

    Gives the output:

    shape: (3, 3)
    ┌──────────┬──────────────┬──────────┐
    │ volume   ┆ value        ┆ value_tb │
    │ ---      ┆ ---          ┆ ---      │
    │ str      ┆ str          ┆ f64      │
    ╞══════════╪══════════════╪══════════╡
    │ volume01 ┆ 1,000 GB     ┆ 1.0      │
    │ volume02 ┆ 2,000,000 MB ┆ 2.0      │
    │ volume03 ┆ 3 TB         ┆ 3.0      │
    └──────────┴──────────────┴──────────┘