I have a Polars Dataframe that contains 3D spatial coordinates (x, y, z). The coordinates may be defined in either rectangular, cylindrical or spherical systems. The coordinate systems may be local (translated/rotated with respect to global) The Dataframe contains the coordinate values, an ID to identify which coordinate system was used to define the coordinates and the type of this coordinate system. A representative dataframe is included below. I'm trying to do three things with the data in this dataframe,
The translation vectors and rotation matrices are available in another data structure that uses the coordinate system ID as key.
There can be 10s to 100s of millions of coordinates in the Dataframe so speed is very important so I'm thinking Polars Expressions would be appropriate.
I can see how to use GroupBy - first by CID_TYPE and then by CID.
On the GroupBy CID_TYPE, for coordinates with type 'C' (Cylindrical) I need to be able to update the first two coordinates using x=x(cos(y)), y=x(cos(y)).
On the GROUPBY CID, I need to be able to multiple each [X, Y, Z] coordinate with a 3x3 rotation matrix and add an offset vector.
I can easily see how to do this by iterating over the groups, extracting the group data into numpy arrays, updating the dat and then pushing back into the dataframe.
However I'm a Polars noob and and I suspect that there are faster ways to do this with Polars expressions, but I just can't figure them out...... How can I apply different expressions to different groupings?
Any help will be gratefully received.
Note that the real data has 10s of millions of rows. Typically the number of unique CIDs is very small in relation to the number of coordinates.
df = pl.DataFrame({
'ID' : [1, 2, 3, 4, 5, 6],
'CID' : [1, 2, 3, 9, 10, 11],
'CID_TYPE' : ['C', 'R', 'S', 'C', 'R', 'S'],
'COORDS' :[
[1.0, 2.0, 3.0],
[4.0, 5.0, 6.0],
[7.0, 8.0, 9.0],
[10.0, 11.0, 12.0],
[13.0, 14.0, 15.0],
[16.0, 17.0, 18.0]
]
}
)
So I'd say I'm not sure a group_by
is needed here. Polars has an if-else construct called when
to operate separate expressions on column(s).
I ended up unnest
ing the COORD lists into separate X/Y/Z columns because that is easier to work with in polars (editing one element of a list at a time is a pain). So for part 1 of your question:
df = df.with_columns(
pl.col("COORDS").list.to_struct(fields=lambda i: "XYZ"[i])
).unnest("COORDS")
df.with_columns(
X=pl.when(CID_TYPE = "C")
.then(pl.col.X * pl.col.Y.cos())
.when(CID_TYPE = "S")
.then(pl.col.X * pl.col.Y.cos() * pl.col.Z.sin())
.otherwise(pl.col.X),
Y=pl.when(CID_TYPE = "C")
.then(pl.col.X * pl.col.Y.sin())
.when(CID_TYPE = "S")
.then(pl.col.X * pl.col.Y.sin() * pl.col.Z.sin())
.otherwise(pl.col.Y),
Z=pl.when(CID_TYPE = "S")
.then(pl.col.X * pl.col.Z.cos())
.otherwise(pl.col.Z),
)
shape: (6, 6)
┌─────┬─────┬──────────┬───────────┬───────────┬───────────┐
│ ID ┆ CID ┆ CID_TYPE ┆ X ┆ Y ┆ Z │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ str ┆ f64 ┆ f64 ┆ f64 │
╞═════╪═════╪══════════╪═══════════╪═══════════╪═══════════╡
│ 1 ┆ 1 ┆ C ┆ -0.416147 ┆ 0.909297 ┆ 3.0 │
│ 2 ┆ 2 ┆ R ┆ 4.0 ┆ 5.0 ┆ 6.0 │
│ 3 ┆ 3 ┆ S ┆ -0.419743 ┆ 2.85413 ┆ -6.377912 │
│ 4 ┆ 9 ┆ C ┆ 0.044257 ┆ -9.999902 ┆ 12.0 │
│ 5 ┆ 10 ┆ R ┆ 13.0 ┆ 14.0 ┆ 15.0 │
│ 6 ┆ 11 ┆ S ┆ 3.306307 ┆ 11.551956 ┆ 10.565067 │
└─────┴─────┴──────────┴───────────┴───────────┴───────────┘
For parts 2 to 3, I'd say joining your separate tables to this one on CID_TYPE
so each row has the right rotations & translations, is the way to go and then you can operate from there. I'm not sure about matrix multiplication in polars without a custom function, but there is dot
.