Aggregate points into a grid using Polars

I have a points dataset in the following format (x, y, value), is it possible to get aggregated dataset using Polars native (maybe even lazy) code as much as possible?

Basically I want to create a virtual grid and then sum all the points in respective grid cells, x_step and y_step being grid cell dimensions. The output would be a dataset with columns (cell_x, cell_y, agg_value) while cell_x and cell_y columns only taking values divisible by my predefined steps (representing bottom left coordinate of the grid cell), and agg_value being a sum of all point values inside the cell:
(cell_x >= x and x < cell_x + x_step) and (cell_y >= y and y < cell_y + y_step).

Currently I am iterating from start, incrementing my variable cell_x by x_step and the same for Y axis in a nested loop. Then I call sum on the filtered subset (cell) of points and I output one row to the output cell. It is rather slow in Python.

Here is a visual example, all points have value of 1 for simplicity:

Solution

Edit: As of Polars 0.14.1, we can use the // operator as floor division instead of using floor, so that the algorithm becomes:

step_x = 2
step_y = 2
(
    df.with_columns(
        ((pl.col("x") // step_x) * step_x).alias("cell_x"),
        ((pl.col("y") // step_y) * step_y).alias("cell_y"),
    )
    .group_by("cell_x", "cell_y")
    .agg(
        pl.col("val").sum().alias("agg_value"),
    )
    .sort("cell_x", "cell_y").collect()
)

One easy (and performant) way to solve this is to use the floor function to calculate the grid coordinates. We can easily accomplish all of this in Lazy mode, and using only Polars Expressions. (And best of all, no slow nested for loops.)

Let's start with your data. We'll put the DataFrame in Lazy mode.

import polars as pl
df = (
    pl.DataFrame(
        {
            "x": [0.5, 2, 2.5, 5.5],
            "y": [1.5, 2.5, 3.5, 3.5],
            "val": [1, 1, 1, 1],
        }
    )
).lazy()
df.collect()

shape: (4, 3)
┌─────┬─────┬─────┐
│ x   ┆ y   ┆ val │
│ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ i64 │
╞═════╪═════╪═════╡
│ 0.5 ┆ 1.5 ┆ 1   │
│ 2.0 ┆ 2.5 ┆ 1   │
│ 2.5 ┆ 3.5 ┆ 1   │
│ 5.5 ┆ 3.5 ┆ 1   │
└─────┴─────┴─────┘

We can aggregate the values into grid cells as follows:

step_x = 2
step_y = 2
(
    df.with_columns(
        ((pl.col("x") / step_x).floor() * step_x).alias("cell_x"),
        ((pl.col("y") / step_y).floor() * step_y).alias("cell_y"),
    )
    .group_by("cell_x", "cell_y")
    .agg(
        pl.col("val").sum().alias("agg_value"),
    )
    .sort("cell_x", "cell_y")
    .collect()
)

shape: (3, 3)
┌────────┬────────┬───────────┐
│ cell_x ┆ cell_y ┆ agg_value │
│ ---    ┆ ---    ┆ ---       │
│ f64    ┆ f64    ┆ i64       │
╞════════╪════════╪═══════════╡
│ 0.0    ┆ 0.0    ┆ 1         │
│ 2.0    ┆ 2.0    ┆ 2         │
│ 4.0    ┆ 2.0    ┆ 1         │
└────────┴────────┴───────────┘

Since the calculation of the coordinates is in a with_columns context, they will run in parallel.

Other Notes

To trace how the algorithm maps each point to the lower-left coordinate of the grid that contains it, just comment out the group_by and agg methods.

step_x = 2
step_y = 2
(
    df.with_columns(
        ((pl.col("x") / step_x).floor() * step_x).alias("cell_x"),
        ((pl.col("y") / step_y).floor() * step_y).alias("cell_y"),
    )
    .sort("cell_x", "cell_y")
    .collect()
)

shape: (4, 5)
┌─────┬─────┬─────┬────────┬────────┐
│ x   ┆ y   ┆ val ┆ cell_x ┆ cell_y │
│ --- ┆ --- ┆ --- ┆ ---    ┆ ---    │
│ f64 ┆ f64 ┆ i64 ┆ f64    ┆ f64    │
╞═════╪═════╪═════╪════════╪════════╡
│ 0.5 ┆ 1.5 ┆ 1   ┆ 0.0    ┆ 0.0    │
│ 2.0 ┆ 2.5 ┆ 1   ┆ 2.0    ┆ 2.0    │
│ 2.5 ┆ 3.5 ┆ 1   ┆ 2.0    ┆ 2.0    │
│ 5.5 ┆ 3.5 ┆ 1   ┆ 4.0    ┆ 2.0    │
└─────┴─────┴─────┴────────┴────────┘

Also, I was careful to design the algorithm to work with negative grid coordinates (if you need that). In general, you have to be careful to distinguish between "truncating" and "casting" which converts -2.5 to -2, versus "floor" which converts -2.5 to -3. (We want "floor" in this case.)

For example:

import numpy as np

rng = np.random.default_rng(1)
nbr_rows = 10
df = (
    pl.DataFrame(
        {
            "x": rng.uniform(-10, 10, nbr_rows),
            "y": rng.uniform(-10, 10, nbr_rows),
            "val": rng.integers(1, 10, nbr_rows),
        }
    )
    .with_columns(pl.col("x", "y").round(1))
    .sort("x", "y")
    .lazy()
)
df.collect()

shape: (10, 3)
┌──────┬──────┬─────┐
│ x    ┆ y    ┆ val │
│ ---  ┆ ---  ┆ --- │
│ f64  ┆ f64  ┆ i64 │
╞══════╪══════╪═════╡
│ -9.4 ┆ -4.8 ┆ 9   │
│ -7.1 ┆ -3.4 ┆ 1   │
│ -3.8 ┆ -3.9 ┆ 5   │
│ -1.8 ┆ -1.9 ┆ 9   │
│ -1.5 ┆ -0.9 ┆ 5   │
│ 0.2  ┆ 5.1  ┆ 1   │
│ 1.0  ┆ -5.9 ┆ 7   │
│ 6.6  ┆ -7.3 ┆ 2   │
│ 9.0  ┆ 0.8  ┆ 7   │
│ 9.0  ┆ 5.8  ┆ 3   │
└──────┴──────┴─────┘

The algorithm maps each point to lower-left grid coordinates as follows:

shape: (10, 5)
┌──────┬──────┬─────┬────────┬────────┐
│ x    ┆ y    ┆ val ┆ cell_x ┆ cell_y │
│ ---  ┆ ---  ┆ --- ┆ ---    ┆ ---    │
│ f64  ┆ f64  ┆ i64 ┆ f64    ┆ f64    │
╞══════╪══════╪═════╪════════╪════════╡
│ -9.4 ┆ -4.8 ┆ 9   ┆ -10.0  ┆ -6.0   │
│ -7.1 ┆ -3.4 ┆ 1   ┆ -8.0   ┆ -4.0   │
│ -3.8 ┆ -3.9 ┆ 5   ┆ -4.0   ┆ -4.0   │
│ -1.8 ┆ -1.9 ┆ 9   ┆ -2.0   ┆ -2.0   │
│ -1.5 ┆ -0.9 ┆ 5   ┆ -2.0   ┆ -2.0   │
│ 1.0  ┆ -5.9 ┆ 7   ┆ 0.0    ┆ -6.0   │
│ 0.2  ┆ 5.1  ┆ 1   ┆ 0.0    ┆ 4.0    │
│ 6.6  ┆ -7.3 ┆ 2   ┆ 6.0    ┆ -8.0   │
│ 9.0  ┆ 0.8  ┆ 7   ┆ 8.0    ┆ 0.0    │
│ 9.0  ┆ 5.8  ┆ 3   ┆ 8.0    ┆ 4.0    │
└──────┴──────┴─────┴────────┴────────┘