I have a points dataset in the following format (x, y, value)
, is it possible to get aggregated dataset using Polars native (maybe even lazy) code as much as possible?
Basically I want to create a virtual grid and then sum all the points in respective grid cells, x_step
and y_step
being grid cell dimensions. The output would be a dataset with columns (cell_x, cell_y, agg_value)
while cell_x
and cell_y
columns only taking values divisible by my predefined steps (representing bottom left coordinate of the grid cell), and agg_value
being a sum of all point values inside the cell:
(cell_x >= x and x < cell_x + x_step) and (cell_y >= y and y < cell_y + y_step)
.
Currently I am iterating from start, incrementing my variable cell_x
by x_step
and the same for Y axis in a nested loop. Then I call sum on the filtered subset (cell) of points and I output one row to the output cell. It is rather slow in Python.
Here is a visual example, all points have value of 1 for simplicity:
Edit: As of Polars 0.14.1, we can use the //
operator as floor division instead of using floor
, so that the algorithm becomes:
step_x = 2
step_y = 2
(
df.with_columns(
((pl.col("x") // step_x) * step_x).alias("cell_x"),
((pl.col("y") // step_y) * step_y).alias("cell_y"),
)
.group_by("cell_x", "cell_y")
.agg(
pl.col("val").sum().alias("agg_value"),
)
.sort("cell_x", "cell_y").collect()
)
One easy (and performant) way to solve this is to use the floor
function to calculate the grid coordinates. We can easily accomplish all of this in Lazy mode, and using only Polars Expressions. (And best of all, no slow nested for loops.)
Let's start with your data. We'll put the DataFrame in Lazy mode.
import polars as pl
df = (
pl.DataFrame(
{
"x": [0.5, 2, 2.5, 5.5],
"y": [1.5, 2.5, 3.5, 3.5],
"val": [1, 1, 1, 1],
}
)
).lazy()
df.collect()
shape: (4, 3)
┌─────┬─────┬─────┐
│ x ┆ y ┆ val │
│ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ i64 │
╞═════╪═════╪═════╡
│ 0.5 ┆ 1.5 ┆ 1 │
│ 2.0 ┆ 2.5 ┆ 1 │
│ 2.5 ┆ 3.5 ┆ 1 │
│ 5.5 ┆ 3.5 ┆ 1 │
└─────┴─────┴─────┘
We can aggregate the values into grid cells as follows:
step_x = 2
step_y = 2
(
df.with_columns(
((pl.col("x") / step_x).floor() * step_x).alias("cell_x"),
((pl.col("y") / step_y).floor() * step_y).alias("cell_y"),
)
.group_by("cell_x", "cell_y")
.agg(
pl.col("val").sum().alias("agg_value"),
)
.sort("cell_x", "cell_y")
.collect()
)
shape: (3, 3)
┌────────┬────────┬───────────┐
│ cell_x ┆ cell_y ┆ agg_value │
│ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ i64 │
╞════════╪════════╪═══════════╡
│ 0.0 ┆ 0.0 ┆ 1 │
│ 2.0 ┆ 2.0 ┆ 2 │
│ 4.0 ┆ 2.0 ┆ 1 │
└────────┴────────┴───────────┘
Since the calculation of the coordinates is in a with_columns
context, they will run in parallel.
To trace how the algorithm maps each point to the lower-left coordinate of the grid that contains it, just comment out the group_by
and agg
methods.
step_x = 2
step_y = 2
(
df.with_columns(
((pl.col("x") / step_x).floor() * step_x).alias("cell_x"),
((pl.col("y") / step_y).floor() * step_y).alias("cell_y"),
)
.sort("cell_x", "cell_y")
.collect()
)
shape: (4, 5)
┌─────┬─────┬─────┬────────┬────────┐
│ x ┆ y ┆ val ┆ cell_x ┆ cell_y │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ i64 ┆ f64 ┆ f64 │
╞═════╪═════╪═════╪════════╪════════╡
│ 0.5 ┆ 1.5 ┆ 1 ┆ 0.0 ┆ 0.0 │
│ 2.0 ┆ 2.5 ┆ 1 ┆ 2.0 ┆ 2.0 │
│ 2.5 ┆ 3.5 ┆ 1 ┆ 2.0 ┆ 2.0 │
│ 5.5 ┆ 3.5 ┆ 1 ┆ 4.0 ┆ 2.0 │
└─────┴─────┴─────┴────────┴────────┘
Also, I was careful to design the algorithm to work with negative grid coordinates (if you need that). In general, you have to be careful to distinguish between "truncating" and "casting" which converts -2.5 to -2, versus "floor" which converts -2.5 to -3. (We want "floor" in this case.)
For example:
import numpy as np
rng = np.random.default_rng(1)
nbr_rows = 10
df = (
pl.DataFrame(
{
"x": rng.uniform(-10, 10, nbr_rows),
"y": rng.uniform(-10, 10, nbr_rows),
"val": rng.integers(1, 10, nbr_rows),
}
)
.with_columns(pl.col("x", "y").round(1))
.sort("x", "y")
.lazy()
)
df.collect()
shape: (10, 3)
┌──────┬──────┬─────┐
│ x ┆ y ┆ val │
│ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ i64 │
╞══════╪══════╪═════╡
│ -9.4 ┆ -4.8 ┆ 9 │
│ -7.1 ┆ -3.4 ┆ 1 │
│ -3.8 ┆ -3.9 ┆ 5 │
│ -1.8 ┆ -1.9 ┆ 9 │
│ -1.5 ┆ -0.9 ┆ 5 │
│ 0.2 ┆ 5.1 ┆ 1 │
│ 1.0 ┆ -5.9 ┆ 7 │
│ 6.6 ┆ -7.3 ┆ 2 │
│ 9.0 ┆ 0.8 ┆ 7 │
│ 9.0 ┆ 5.8 ┆ 3 │
└──────┴──────┴─────┘
The algorithm maps each point to lower-left grid coordinates as follows:
shape: (10, 5)
┌──────┬──────┬─────┬────────┬────────┐
│ x ┆ y ┆ val ┆ cell_x ┆ cell_y │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ i64 ┆ f64 ┆ f64 │
╞══════╪══════╪═════╪════════╪════════╡
│ -9.4 ┆ -4.8 ┆ 9 ┆ -10.0 ┆ -6.0 │
│ -7.1 ┆ -3.4 ┆ 1 ┆ -8.0 ┆ -4.0 │
│ -3.8 ┆ -3.9 ┆ 5 ┆ -4.0 ┆ -4.0 │
│ -1.8 ┆ -1.9 ┆ 9 ┆ -2.0 ┆ -2.0 │
│ -1.5 ┆ -0.9 ┆ 5 ┆ -2.0 ┆ -2.0 │
│ 1.0 ┆ -5.9 ┆ 7 ┆ 0.0 ┆ -6.0 │
│ 0.2 ┆ 5.1 ┆ 1 ┆ 0.0 ┆ 4.0 │
│ 6.6 ┆ -7.3 ┆ 2 ┆ 6.0 ┆ -8.0 │
│ 9.0 ┆ 0.8 ┆ 7 ┆ 8.0 ┆ 0.0 │
│ 9.0 ┆ 5.8 ┆ 3 ┆ 8.0 ┆ 4.0 │
└──────┴──────┴─────┴────────┴────────┘