I have a polars List[f64], column "a". I want to create a new List[f64], column "b", which is a sequence from the min to the max of that row's list in column a, in intervals of 0.5, inclusive. So for a row with a column "a" list of [0.0, 3.0, 2.0, 6.0, 2.0]
, the value in column b should be [0.0, 0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 5.0, 5.5, 6.0]
.
This is my solution, but it has an error.
df = df.with_columns(
pl.col("a").list.eval(
pl.arange(pl.element().min(), pl.element().max(), 1)
.append(pl.arange(pl.element().min(), pl.element().max(), 1) + 0.5)
.append(pl.element().max())
.append(pl.element().max() - 0.5)
.unique()
.sort(),
parallel=True,
)
.alias("b")
)
It fails the edge case of when column a only contains 1 unique value in its list. Since polars seems to only have an integer arange()
function, when I create the second list and add 0.5, if there is only one unique value this results in having 2 values in the output, the actual value seen, and the actual value seen - 0.5
Here is some toy data. Column "a" contains the lists, the min's and max's of which should be used to define the boundaries of the sequence, which is column "b".
pl.DataFrame([
pl.Series('a', [[4.0, 5.0, 3.0, 7.0, 0.0, 1.0, 6.0, 2.0], [2.0, 4.0, 3.0, 0.0, 1.0], [1.0, 2.0, 3.0, 0.0, 4.0], [1.0, 3.0, 2.0, 0.0], [1.0, 0.0]], dtype=pl.List(pl.Float64)),
pl.Series('b', [[0.0, 0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 5.0, 5.5, 6.0, 6.5, 7.0], [0.0, 0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0], [0.0, 0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0], [0.0, 0.5, 1.0, 1.5, 2.0, 2.5, 3.0], [0.0, 0.5, 1.0]], dtype=pl.List(pl.Float64))
])
Speed is pretty important here, I am rewriting in Polars for that purpose here. Thanks.
It's relatively simple by creating an inclusive integer range from 2*min
to 2*max
and dividing it by 2:
df.with_columns(b = pl.col.a.list.eval(
pl.arange(2*pl.element().min(), 2*pl.element().max() + 1) / 2
))