I have a Utf8
column in my DataFrame, and from that I want to create a column of List<Utf8>
.
In particular for each row I am taking the text of a HTML document and using soup
to parse out all the paragraphs of class <p>
, and store the collection of text of each separate paragraph as a Vec<String>
or Vec<&str>
. I have this as a standalone function:
fn parse_paragraph(s: &str) -> Vec<&str> {
let soup = Soup::new(s);
soup.tag(p).find_all().iter().map(|&p| p.text()).collect()
}
In trying to adapt the few available examples of applying custom functions in Rust polars, I can't seem to get the conversion to compile.
Take this MVP example, using a simpler string-to-vec-of-strings example, borrowing from the Iterators example from the documentation:
use polars::prelude::*;
fn vector_split(text: &str) -> Vec<&str> {
text.split(' ').collect()
}
fn vector_split_series(s: &Series) -> PolarsResult<Series> {
let output : Series = s.utf8()
.expect("Text data")
.into_iter()
.map(|t| t.map(vector_split))
.collect();
Ok(output)
}
fn main() {
let df = df! [
"text" => ["a cat on the mat", "a bat on the hat", "a gnat on the rat"]
].unwrap();
df.clone().lazy()
.select([
col("text").apply(|s| vector_split_series(&s), GetOutput::default())
.alias("words")
])
.collect();
}
(Note: I know there is an in-built split
function for utf8 Series, but I needed a simpler example than parsing HTML)
I get the following error from cargo check
:
error[E0277]: a value of type `polars::prelude::Series` cannot be built from an iterator over elements of type `Option<Vec<&str>>`
--> src/main.rs:11:27
|
11 | let output : Series = s.utf8()
| ___________________________^
12 | | .expect("Text data")
13 | | .into_iter()
14 | | .map(|t| t.map(vector_split))
| |_____________________________________^ value of type `polars::prelude::Series` cannot be built from `std::iter::Iterator<Item=Option<Vec<&str>>>`
15 | .collect();
| ------- required by a bound introduced by this call
|
= help: the trait `FromIterator<Option<Vec<&str>>>` is not implemented for `polars::prelude::Series`
= help: the following other types implement trait `FromIterator<A>`:
<polars::prelude::Series as FromIterator<&'a bool>>
<polars::prelude::Series as FromIterator<&'a f32>>
<polars::prelude::Series as FromIterator<&'a f64>>
<polars::prelude::Series as FromIterator<&'a i32>>
<polars::prelude::Series as FromIterator<&'a i64>>
<polars::prelude::Series as FromIterator<&'a str>>
<polars::prelude::Series as FromIterator<&'a u32>>
<polars::prelude::Series as FromIterator<&'a u64>>
and 15 others
note: required by a bound in `std::iter::Iterator::collect`
What is the correct idiom for this kind of procedure? Is there a simpler way to apply a function?
For future seekers, I will explain the general solution and then the specific code to make the example work. I'll also point out some gotchas for this specific example.
If you need to use a custom function instead of using the convenient Expr
expressions, at the core of it you'll need to make a function that converts the Series
of the input column into a Series
backed by a ChunkedArray
of the correct output type. This function is what you give to map
in the select
statement in main
. The type of the ChunkedArray is the type you provide as GetOutput
.
The code inside vector_split_series
in the question works for conversion functions of standard numeric types, or List of numeric types. It does not work automatically for Lists
of Utf8
strings, for example, as they are treated specially for ChunkedArrays
. This is for performance reasons. You need to build up the Series
explicitly, via the correct type builder.
In the question's case, we need to use a ListUtf8ChunkedBuilder
which will create a ChunkedArray
of List<Utf8>
.
So in general, the question's code works for conversion outputs that are numeric or Lists of numerics. But for lists of strings, you need to use a ListUtf8ChunkedBuilder
.
The correct code for the question's example looks like this:
use polars::prelude::*;
fn vector_split(text: &str) -> Vec<String> {
text.split(' ').map(|x| x.to_owned()).collect()
}
fn vector_split_series(s: Series) -> PolarsResult<Series> {
let ca = s.utf8()?;
let mut builder = ListUtf8ChunkedBuilder::new("words", s.len(), ca.get_values_size());
ca.into_iter()
.for_each(|opt_s| match opt_s {
None => builder.append_null(),
Some(s) => {
builder.append_series(
&Series::new("words", vector_split(s).into_iter() )
)
}});
Ok(builder.finish().into_series())
}
fn main() {
let df = df! [
"text" => ["a cat on the mat", "a bat on the hat", "a gnat on the rat"]
].unwrap();
let df2 = df.clone().lazy()
.select([
col("text")
.apply(|s| vector_split_series(s), GetOutput::from_type(DataType::List(Box::new(DataType::Utf8))))
// Can instead use default if the compiler can determine the types
//.apply(|s| vector_split_series(s), GetOutput::default())
.alias("words")
])
.collect()
.unwrap();
println!("{:?}", df2);
}
The core is in vector_split_series
. It has that function definition to be used in map
.
The match statement is required because Series
can have null entries, and to preserve the length of the Series
, you need to pass nulls through. We use the builder here so it appends the appropriate null.
For non-null entries the builder needs to append Series
. Normally you can append_from_iter
, but there is (as of polars 0.26.1) no implementation of FromIterator for Iterator<Item=Vec<T>>
. So you need to convert the collection into an iterator on values, and that iterator into a new Series
.
Once the larger ChunkedArray
(of type ListUtf8ChunkedArray
) is built, you can convert it into a PolarsResult<Series>
to return to map
.
In the above example, vector_split
can return Vec<String>
or Vec<&str>
. This is because split
creates its iterator of &str
in a nice way.
If you are using something more complicated --- like my original example of extracting text via Soup
queries --- if they output iterators of &str
, the references may be considered owned by temporary and then you will have issues about returning references to temporaries.
This is why in the working code, I pass Vec<String>
back to the builder, even though it is not strictly required.