In Polars, I'm seeing a return result different than what I would expect when using slicing with series and trying to get the offsets.
I'm creating a Series, then slicing it:
// Make a vec of 3 items, called foo, bar baz
let string_values: Vec<&str> = vec!["foo", "bar", "baz"];
// Add it to a series, this is without dataframes
let series = Series::new("string_values", string_values);
//shape: (3,)
// Series: 'string_values' [str]
// [
// "foo"
// "bar"
// "baz"
// ]
println!("{:?}", series);
This returns a new series.
I can then using downcast_iter() to get the offsets:
// Now we should be able to downcast iter to get the offsets.
// returns [0, 3, 6, 9]
// 0-3 = foo
// 3-6 = bar
// 6-9 = baz
series.utf8().unwrap().downcast_iter().for_each(|array| {
println!("{:?}", array.offsets());
});
Great so far.
I then slice it:
//shape: (2,)
// Series: 'string_values' [str]
// [
// "bar"
// "baz"
// ]
let series_slice = series.slice(1, 2);
println!("{:?}", series_slice);
This returns the correct values.
I then try and use downcast_iter()
again:
// Now we should be able to downcast iter to get the offsets for the slice.
// This returns [3, 6, 9]
// Is "foo" still referenced?
series_slice.utf8().unwrap().downcast_iter().for_each(|array| {
println!("{:?}", array.offsets());
});
It returns 3, 6, 9. Why is 9 returned? The length of the series is 6.
Buffers in arrow can be shared. Besides the data they also have an offset
and a length
.
You original arrow string array contains of the following data:
data: foobarbaz
offsets: 0, 3, 6, 9
offset: 0
length: 3
Retrieving element i
uses the following algorithm in pseudocode:
let offset = array.offset
let start_index = offsets[offset + i]
let end_index = offsets[offset + i + 1]
let string_value = data[start_index..end_index]
When you slice an array, we don't copy any data. We only update the offset
and the length
such that we have all information to represent the sliced array:
data: foobarbaz
offsets: 0, 3, 6, 9
offset: 1
length: 2