Search code examples
rustrust-polars

How to create a column with the lengths of strings from a different column in Polars Rust?


I'm trying to replicate one of the Polars Python examples in Rust but seem to have hit a wall. In the Python docs there is an example which creates a new column with the lengths of the strings from another column. So for example, column B will contain the lengths of all the strings in column A.

The example code looks like this:

import polars as pl

df = pl.DataFrame({"shakespeare": "All that glitters is not gold".split(" ")})

df = df.with_column(pl.col("shakespeare").str.lengths().alias("letter_count")) 

As you can see it uses the str namespace to access the lengths() function but when trying the same in the Rust version this does not work:

use polars::prelude::*;

// This will throw the following error:

// no method named `lengths` found for struct `StringNameSpace` in the current scope

fn print_length_strings_in_column() -> () {
    let df = generate_df().expect("error");
    let new_df = df
        .lazy()
        .with_column(col("vendor_id").str().lengths().alias("vendor_id_length"))
        .collect();
}

Cargo.toml:

[dependencies]
polars = {version = "0.22.8", features = ["strings", "lazy"]}

I checked the docs and it seems like the Rust version of Polars does not implement the lengths() function. There is the str_lengths function in the Utf8NameSpace but it's not entirely clear to me how to use this.

I feel like I'm missing something very simple here but I don't see it. How would i go about tackling this issue?

Thanks!


Solution

  • You have to use apply function and cast the series to Utf8 Chunked Array. It then has a method str_lengths(): https://docs.rs/polars/0.22.8/polars/chunked_array/struct.ChunkedArray.html

    let s = Series::new("vendor_id", &["Ant", "no", "how", "Ant", "mans"]);
    let df = DataFrame::new(vec![s]).unwrap();
    let res = df.lazy()
        .with_column(col("vendor_id").apply(|srs|{
            Ok(srs.utf8()?
                .str_lengths()
                .into_series())
        }, GetOutput::from_type(DataType::Int32))
        .alias("vendor_id_length"))
        .collect();