Search code examples
rustrust-polars

Offsets in downcast_iter and series slicing in Polars


In Polars, I'm seeing a return result different than what I would expect when using slicing with series and trying to get the offsets.

I'm creating a Series, then slicing it:

// Make a vec of 3 items, called foo, bar baz
let string_values: Vec<&str> = vec!["foo", "bar", "baz"];
// Add it to a series, this is without dataframes
let series = Series::new("string_values", string_values);

//shape: (3,)
// Series: 'string_values' [str]
// [
//  "foo"
//  "bar"
//  "baz"
// ]
println!("{:?}", series);

This returns a new series.

I can then using downcast_iter() to get the offsets:

// Now we should be able to downcast iter to get the offsets.
// returns [0, 3, 6, 9]
// 0-3 = foo
// 3-6 = bar
// 6-9 = baz
series.utf8().unwrap().downcast_iter().for_each(|array| {
    println!("{:?}", array.offsets());
});

Great so far.

I then slice it:

//shape: (2,)
// Series: 'string_values' [str]
// [
//  "bar"
//  "baz"
// ]
let series_slice = series.slice(1, 2);
println!("{:?}", series_slice);

This returns the correct values.

I then try and use downcast_iter() again:

// Now we should be able to downcast iter to get the offsets for the slice.
// This returns [3, 6, 9]
// Is "foo" still referenced?
series_slice.utf8().unwrap().downcast_iter().for_each(|array| {
    println!("{:?}", array.offsets());
});

It returns 3, 6, 9. Why is 9 returned? The length of the series is 6.


Solution

  • Buffers in arrow can be shared. Besides the data they also have an offset and a length.

    You original arrow string array contains of the following data:

    data:     foobarbaz
    offsets:  0, 3, 6, 9
    offset:   0
    length:   3
    

    Retrieving element i uses the following algorithm in pseudocode:

    let offset = array.offset
    let start_index = offsets[offset + i]
    let end_index = offsets[offset + i + 1]
    
    let string_value = data[start_index..end_index]
    

    When you slice an array, we don't copy any data. We only update the offset and the length such that we have all information to represent the sliced array:

    data:     foobarbaz
    offsets:  0, 3, 6, 9
    offset:   1
    length:   2