Search code examples
stringrustrust-polars

LazyFrame: How to do string manipulation on values in a single column


I want to change all string values in a LazyFrame-Column.

e.g. from "alles ok" ==> to "ALLES OK"

I see that a series has a function to do it:

polars.internals.series.StringNameSpace.to_uppercase

Q: What is the proper way to apply a string (or Date) manipulation on just one column in a LazyFrame?

Q: Do I need to extract the column I want to work on as a series and re-integrate it?

I can do math on elements of a column and put the result in a new column e.g.:

df.with_column((col("b") ** 2).alias("b_squared")).collect() 

but strings?


Solution

  • Ok, after some digging I was able to take a string-column of a LazyFrame and convert it to dtype(datetime).

    I also found a code snippet to apply a "len" function to the first column and add the result into a new column:

    use polars::prelude::*;
    
    fn main() {
        let df: Result<DataFrame> = df!("column_1" => &["Tuesday"],
                                    "column_2" => &["1900-01-02"]);
    
        let options = StrpTimeOptions {
            date_dtype: DataType::Datetime(TimeUnit::Milliseconds, None),
            fmt: Some("%Y-%m-%d".into()),
            strict: false,
            exact: true,
        };
    
        // in-place convert string into dtype(datetime)
        let days = df
            .unwrap()
            .lazy()
            .with_column(col("column_2").str().strptime(options));
    
        // ### courtesy of Alex Moore-Niemi:
        let o = GetOutput::from_type(DataType::UInt32);
        fn str_to_len(str_val: Series) -> Result<Series> {
            let x = str_val
                .utf8()
                .unwrap()
                .into_iter()
                // your actual custom function would be in this map
                .map(|opt_name: Option<&str>| opt_name.map(|name: &str| name.len() as u32))
                .collect::<UInt32Chunked>();
            Ok(x.into_series())
        }
        // ###
    
        // add new column with length of string in column_1
        let days = days
            .with_column(col("column_1").alias("new_column").apply(str_to_len, o))
            .collect()
            .unwrap();
    
        let o = GetOutput::from_type(DataType::Utf8);
        fn str_to_uppercase(str_val: Series) -> Result<Series> {
            let x = str_val
                .utf8()
                .unwrap()
                .into_iter()
                // your actual custom function would be in this map
                .map(|opt_name: Option<&str>| opt_name.map(|name: &str| name.to_uppercase()))
                .collect::<Utf8Chunked>();
            Ok(x.into_series())
        }
    
        // column_1 to UPPERCASE ... in-place
        let days = days
            .lazy()
            .with_column(col("column_1").apply(str_to_uppercase, o))
            .collect()
            .unwrap();
    
        println!("{}", days);
    }