Search code examples
rustrust-polars

Converting a Utf8 Series into a Series of List<Utf8> via a custom function in Rust polars


I have a Utf8 column in my DataFrame, and from that I want to create a column of List<Utf8>.

In particular for each row I am taking the text of a HTML document and using soup to parse out all the paragraphs of class <p>, and store the collection of text of each separate paragraph as a Vec<String> or Vec<&str>. I have this as a standalone function:

fn parse_paragraph(s: &str) -> Vec<&str> {

    let soup = Soup::new(s);
    
    soup.tag(p).find_all().iter().map(|&p| p.text()).collect()

}

In trying to adapt the few available examples of applying custom functions in Rust polars, I can't seem to get the conversion to compile.

Take this MVP example, using a simpler string-to-vec-of-strings example, borrowing from the Iterators example from the documentation:

use polars::prelude::*;

fn vector_split(text: &str) -> Vec<&str> {

    text.split(' ').collect()
    
}

fn vector_split_series(s: &Series) -> PolarsResult<Series> {

    let output : Series = s.utf8()
        .expect("Text data")
        .into_iter()
        .map(|t| t.map(vector_split))
        .collect();

    Ok(output)
    
}

fn main() {

    let df = df! [
        "text" => ["a cat on the mat", "a bat on the hat", "a gnat on the rat"]
    ].unwrap();

    df.clone().lazy()
        .select([
            col("text").apply(|s| vector_split_series(&s), GetOutput::default())
                .alias("words")
        ])
        .collect();
    
}

(Note: I know there is an in-built split function for utf8 Series, but I needed a simpler example than parsing HTML)

I get the following error from cargo check:

error[E0277]: a value of type `polars::prelude::Series` cannot be built from an iterator over elements of type `Option<Vec<&str>>`
    --> src/main.rs:11:27
     |
11   |       let output : Series = s.utf8()
     |  ___________________________^
12   | |         .expect("Text data")
13   | |         .into_iter()
14   | |         .map(|t| t.map(vector_split))
     | |_____________________________________^ value of type `polars::prelude::Series` cannot be built from `std::iter::Iterator<Item=Option<Vec<&str>>>`
15   |           .collect();
     |            ------- required by a bound introduced by this call
     |
     = help: the trait `FromIterator<Option<Vec<&str>>>` is not implemented for `polars::prelude::Series`
     = help: the following other types implement trait `FromIterator<A>`:
               <polars::prelude::Series as FromIterator<&'a bool>>
               <polars::prelude::Series as FromIterator<&'a f32>>
               <polars::prelude::Series as FromIterator<&'a f64>>
               <polars::prelude::Series as FromIterator<&'a i32>>
               <polars::prelude::Series as FromIterator<&'a i64>>
               <polars::prelude::Series as FromIterator<&'a str>>
               <polars::prelude::Series as FromIterator<&'a u32>>
               <polars::prelude::Series as FromIterator<&'a u64>>
             and 15 others
note: required by a bound in `std::iter::Iterator::collect`

What is the correct idiom for this kind of procedure? Is there a simpler way to apply a function?


Solution

  • For future seekers, I will explain the general solution and then the specific code to make the example work. I'll also point out some gotchas for this specific example.

    Explanation

    If you need to use a custom function instead of using the convenient Expr expressions, at the core of it you'll need to make a function that converts the Series of the input column into a Series backed by a ChunkedArray of the correct output type. This function is what you give to map in the select statement in main. The type of the ChunkedArray is the type you provide as GetOutput.

    The code inside vector_split_series in the question works for conversion functions of standard numeric types, or List of numeric types. It does not work automatically for Lists of Utf8 strings, for example, as they are treated specially for ChunkedArrays. This is for performance reasons. You need to build up the Series explicitly, via the correct type builder.

    In the question's case, we need to use a ListUtf8ChunkedBuilder which will create a ChunkedArray of List<Utf8>.

    So in general, the question's code works for conversion outputs that are numeric or Lists of numerics. But for lists of strings, you need to use a ListUtf8ChunkedBuilder.

    Correct code

    The correct code for the question's example looks like this:

    use polars::prelude::*;
    
    fn vector_split(text: &str) -> Vec<String> {
    
        text.split(' ').map(|x| x.to_owned()).collect()
        
    }
    
    fn vector_split_series(s: Series) -> PolarsResult<Series> {
    
        let ca = s.utf8()?;
    
        let mut builder = ListUtf8ChunkedBuilder::new("words", s.len(), ca.get_values_size());
    
        ca.into_iter()
            .for_each(|opt_s| match opt_s {
                None => builder.append_null(),
                Some(s) => {
                    builder.append_series(
                        &Series::new("words", vector_split(s).into_iter() )
                    )
                }});
    
        Ok(builder.finish().into_series())
        
    }
    
    fn main() {
    
        let df = df! [
            "text" => ["a cat on the mat", "a bat on the hat", "a gnat on the rat"]
        ].unwrap();
    
        let df2 = df.clone().lazy()
            .select([
                col("text")
                    .apply(|s| vector_split_series(s), GetOutput::from_type(DataType::List(Box::new(DataType::Utf8))))
    
                    // Can instead use default if the compiler can determine the types
                    //.apply(|s| vector_split_series(s), GetOutput::default())
                    .alias("words")
            ])
            .collect()
            .unwrap();
    
        println!("{:?}", df2);
        
    }
    

    The core is in vector_split_series. It has that function definition to be used in map.

    The match statement is required because Series can have null entries, and to preserve the length of the Series, you need to pass nulls through. We use the builder here so it appends the appropriate null.

    For non-null entries the builder needs to append Series. Normally you can append_from_iter, but there is (as of polars 0.26.1) no implementation of FromIterator for Iterator<Item=Vec<T>>. So you need to convert the collection into an iterator on values, and that iterator into a new Series.

    Once the larger ChunkedArray (of type ListUtf8ChunkedArray) is built, you can convert it into a PolarsResult<Series> to return to map.

    Gotcha

    In the above example, vector_split can return Vec<String> or Vec<&str>. This is because split creates its iterator of &str in a nice way.

    If you are using something more complicated --- like my original example of extracting text via Soup queries --- if they output iterators of &str, the references may be considered owned by temporary and then you will have issues about returning references to temporaries.

    This is why in the working code, I pass Vec<String> back to the builder, even though it is not strictly required.