Search code examples
rustrust-polars

Avoid clone when re-using a Rust Polars DataFrame?


I am trying to implement a bias-corrected accelerated confidence interval in Rust. My metric function takes in a Rust dataframe and does some operations on it to return an f64. In the example below, obviously .lazy() is not needed, but the real function does require it (it does group_bys, etc.). To do a bCa confidence interval, one step is to calculate the metric on the original sample. The second step is to do a jacknife, calculating the metric on the sample with the ith row deleted. The issue is, without the .clone() on the first step, Rust complains about "borrow of moved value". If I change metric to take a reference, then I either have to clone within the function or dereference within the function, or I get "cannot move out of a shared reference". Is it possible to avoid this clone, or is the clone very cheap and I shouldn't worry about it?

use polars::prelude::*;
use rayon::iter::{IntoParallelIterator, ParallelIterator};

fn metric(df: DataFrame) -> f64 {
    df.lazy().collect().unwrap()["x"].sum().unwrap()
}

pub fn bca_confidence_interval(df: DataFrame) -> (f64, f64, f64) {
    let df_height = df.height();
    let stat_original = metric(df.clone());

    let index = ChunkedArray::new("index", 0..df_height as u64);
    let jacknife_stats: Vec<f64> = (0..df_height)
        .into_par_iter()
        .map(|i| metric(df.filter(&index.not_equal(i)).unwrap()))
        .filter(|x| !x.is_nan())
        .collect();

    (0.0, 1.0, 2.0)
}

Solution

  • The .clone() is pretty cheap. I wouldn't worry about it.

    A DataFrame just holds a Vec<Series> for its columns and Series just holds an Arc<_>. Arcs facilitate shared ownership so a clone just increments a counter. So there is no massive deep-copy when cloning the dataframe; the data is shared.