Hi I'm working on this fan fiction project of a full feature + syntax translation of pypolars to R called "minipolars".
I understand the pypolars API e.g. DataFrame in generel elicits immutable-behavior or isch the same as 'copy-on-write' behaviour. Most methods altering the DataFrame object will return a cheap copy. Exceptions known to me are DataFrame.extend and the @columns.setter. In R, most API's strive for a strictly immutable-behavior. I imagine to both support a strictly immutable behavoir, and optional pypolars-like behavior. Rust-polars API has many mutable operations + lifetimes and what not, but it is understandably all about performance and expressiveness.
The R library data.table
API do stray away from immutable-behavoir some times. However all such operations that are mutable are prefixed set_
or use the set-operator :=
.
By mutable-behavoir I think of e.g. executing the method .extend()
after defining variable df_mutable_copy
and that still affects the value df_mutable_copy
.
import polars as pl
df1 = pl.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]})
df2 = pl.DataFrame({"foo": [10, 20, 30], "bar": [40, 50, 60]})
df_copy = df1
df_copy_old_shape = df_copy.shape
df1.extend(df2)
df_copy_new_shape = df_copy.shape
#extend was a operation with mutable behaviour df_copy was affected.
df_copy_old_shape != df_copy_new_shape
Most of the python polars API is actually a wrapper around polars lazy.
(df.with_columns(..)
.join(..)
.group_by()
.select(..)
translates to:
(df.lazy().with_columns(..).collect(no_optimization=True)
.lazy().join(..).collect(no_optimization=True)
.lazy().group_by().collect(no_optimization=True)
.lazy().select(..).collect(no_optimization=True)
That means that almost all expresions run on the polars query engine. The query engine itself determines if an operation can be done in place, or if it should clone the data (copy on write).
Polars actually has Copy on write on steroids, as it only copies if the data is not shared. If we are the only owner, we mutate in place. We can do this because Rust has a borrow checker, so if we own the data and the ref count is 1, we can mutate the data.
I would recommend you to implement your R-polars API similar to what we do in python. Then all operations can be made pure (e.g. return a new Series/Expr/DataFrame
) and pollars will decide when to mutate in place.
Don't worry about copying data. All data buffers are wrapped in an Arc
, so we only increment a reference count.