Search code examples
python-polars

Principles of immutability and copy-on-write in polars python api


Hi I'm working on this fan fiction project of a full feature + syntax translation of pypolars to R called "minipolars".

I understand the pypolars API e.g. DataFrame in generel elicits immutable-behavior or isch the same as 'copy-on-write' behaviour. Most methods altering the DataFrame object will return a cheap copy. Exceptions known to me are DataFrame.extend and the @columns.setter. In R, most API's strive for a strictly immutable-behavior. I imagine to both support a strictly immutable behavoir, and optional pypolars-like behavior. Rust-polars API has many mutable operations + lifetimes and what not, but it is understandably all about performance and expressiveness.

  • Are there many more central mutable behavoirs in the pypolars-API?
  • Would a pypolars-API with only immutable behavior suffer in performance and expressiveness?

The R library data.table API do stray away from immutable-behavoir some times. However all such operations that are mutable are prefixed set_ or use the set-operator :=.

  • Is there an obvious way in pypolars to recognize if an operation is mutable or not?

By mutable-behavoir I think of e.g. executing the method .extend() after defining variable df_mutable_copy and that still affects the value df_mutable_copy.

import polars as pl

df1 = pl.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]})
df2 = pl.DataFrame({"foo": [10, 20, 30], "bar": [40, 50, 60]})

df_copy = df1
df_copy_old_shape = df_copy.shape

df1.extend(df2)
df_copy_new_shape = df_copy.shape

#extend was a operation with mutable behaviour df_copy was affected.
df_copy_old_shape != df_copy_new_shape 


Solution

  • Most of the python polars API is actually a wrapper around polars lazy.

    (df.with_columns(..)
       .join(..)
       .group_by()
       .select(..)
    

    translates to:

    (df.lazy().with_columns(..).collect(no_optimization=True)
       .lazy().join(..).collect(no_optimization=True)
       .lazy().group_by().collect(no_optimization=True)
       .lazy().select(..).collect(no_optimization=True)
    

    That means that almost all expresions run on the polars query engine. The query engine itself determines if an operation can be done in place, or if it should clone the data (copy on write).

    Polars actually has Copy on write on steroids, as it only copies if the data is not shared. If we are the only owner, we mutate in place. We can do this because Rust has a borrow checker, so if we own the data and the ref count is 1, we can mutate the data.

    I would recommend you to implement your R-polars API similar to what we do in python. Then all operations can be made pure (e.g. return a new Series/Expr/DataFrame) and pollars will decide when to mutate in place.

    Don't worry about copying data. All data buffers are wrapped in an Arc, so we only increment a reference count.