I have a column with lists of different length like below and want to make a parallel np.diff on each of the independent arrays.
import polars as pl
import numpy as np
np.random.seed(0)
ragged_arrays = [np.random.randint(10, size=np.random.choice(range(10))) for _ in range(5)]
df = pl.DataFrame({'values':ragged_arrays})
df
shape: (5, 1)
┌──────────────────────────┐
│ values │
│ --- │
│ list[i64] │
╞══════════════════════════╡
│ [0, 3, 3, 7, 9] │
│ [5, 2, 4] │
│ [6, 8, 8, 1, 6, 7, 7] │
│ [1, 5, 9, 8, 9, 4, 3, 0] │
│ [5, 0, 2] │
└──────────────────────────┘
I have tried to simply apply np.diff like this:
df.select(
np.diff(pl.col("values"))
)
But it gives me this error:
ValueError: diff requires input that is at least one dimensional
It looks like this type of vectorisation is not supported at the moment, but is there any workaround to achieve the same thing with polars? I want to avoid having to group arrays by length before running this.
All of the list methods are available in the List namespace
In this case, Polars has its own .list.diff()
np.random.seed(0)
ragged_arrays = [pl.Series(np.random.randint(10, size=np.random.choice(range(10)))) for _ in range(5)]
(pl.DataFrame({
"values": ragged_arrays
}).with_columns(
pl.col("values").list.diff().alias("values_diff")
))
This yields
shape: (5, 2)
┌──────────────────────────┬─────────────────────────────────┐
│ values ┆ values_diff │
│ --- ┆ --- │
│ list[i64] ┆ list[i64] │
╞══════════════════════════╪═════════════════════════════════╡
│ [0, 3, 3, 7, 9] ┆ [null, 3, 0, 4, 2] │
│ [5, 2, 4] ┆ [null, -3, 2] │
│ [6, 8, 8, 1, 6, 7, 7] ┆ [null, 2, 0, -7, 5, 1, 0] │
│ [1, 5, 9, 8, 9, 4, 3, 0] ┆ [null, 4, 4, -1, 1, -5, -1, -3] │
│ [5, 0, 2] ┆ [null, -5, 2] │
└──────────────────────────┴─────────────────────────────────┘