I want to execute the following lines of python code in Polars as a UDF:
w = wkt.loads('POLYGON((-160.043334960938 70.6363054807905, -160.037841796875 70.6363054807905, -160.037841796875 70.6344840663086, -160.043334960938 70.6344840663086, -160.043334960938 70.6363054807905))')
polygon (optionally including holes).
j = shapely.geometry.mapping(w)
h3.polyfill(j, res=10, geo_json_conformant=True)
In pandas/geopandas:
import pandas as pd
import geopandas as gpd
import polars as pl
from shapely import wkt
pandas_df = pd.DataFrame({'quadkey': {0: '0022133222330023',
1: '0022133222330031',
2: '0022133222330100'},
'tile': {0: 'POLYGON((-160.043334960938 70.6363054807905, -160.037841796875 70.6363054807905, -160.037841796875 70.6344840663086, -160.043334960938 70.6344840663086, -160.043334960938 70.6363054807905))',
1: 'POLYGON((-160.032348632812 70.6381267305321, -160.02685546875 70.6381267305321, -160.02685546875 70.6363054807905, -160.032348632812 70.6363054807905, -160.032348632812 70.6381267305321))',
2: 'POLYGON((-160.02685546875 70.6417687358462, -160.021362304688 70.6417687358462, -160.021362304688 70.6399478155463, -160.02685546875 70.6399478155463, -160.02685546875 70.6417687358462))'},
'avg_d_kbps': {0: 15600, 1: 6790, 2: 9619},
'avg_u_kbps': {0: 14609, 1: 22363, 2: 15757},
'avg_lat_ms': {0: 168, 1: 68, 2: 92},
'tests': {0: 2, 1: 1, 2: 6},
'devices': {0: 1, 1: 1, 2: 1}}
)
# display(pandas_df)
gdf = pandas_df.copy()
gdf['geometry'] = gpd.GeoSeries.from_wkt(pandas_df['tile'])
import h3pandas
display(gdf.h3.polyfill_resample(10))
This works super quickly and easily. However, the polyfill function called from pandas apply as a UDF is too slow for the size of my dataset.
Instead, I would love to use polars but I run into several issues:
rying to move to polars for better performance
pl.from_pandas(gdf)
fails with: ArrowTypeError: Did not pass numpy.dtype object
it looks like geoarrow / geoparquet is not supported by polars
polars_df = pl.from_pandas(pandas_df)
out = polars_df.select(
[
gpd.GeoSeries.from_wkt(pl.col('tile')),
]
)
fails with:
TypeError: 'data' should be array of geometry objects. Use from_shapely, from_wkb, from_wkt functions to construct a GeometryArray.
polars_df.with_column(pl.col('tile').map(lambda x: h3.polyfill(shapely.geometry.mapping(wkt.loads(x)), res=10, geo_json_conformant=True)).alias('geometry'))
fails with:
Conversion of polars data type Utf8 to C-type not implemented.
this last option seems to be the most promising one (no special geospatial-type errors). But this generic error message of strings/Utf8 type for C not being implemented sounds very strange to me.
Furthermore:
polars_df.select(pl.col('tile').apply(lambda x: h3.polyfill(shapely.geometry.mapping(wkt.loads(x)), res=10, geo_json_conformant=True)))
works - but is lacking the other columns - i.e. syntax to manually select these is inconvenient. Though this is also failing when appending a:
.explode('tile').collect()
# InvalidOperationError: cannot explode dtype: Object("object")
To address some of your polars errors:
The wkt
functions can't handle a pl.Series
- you can use .to_numpy()
to provide a numpy array instead:
gpd.GeoSeries.from_wkt(polars_df.get_column("tile").to_numpy())
0 POLYGON ((-160.04333 70.63631, -160.03784 70.6...
1 POLYGON ((-160.03235 70.63813, -160.02686 70.6...
2 POLYGON ((-160.02686 70.64177, -160.02136 70.6...
dtype: geometry
You can use .with_columns()
instead of .select()
:
polars_df.with_columns(pl.col('tile').apply(lambda x: h3.polyfill(shapely.geometry.mapping(wkt.loads(x)), res=10, geo_json_conformant=True)))
shape: (3, 7)
┌──────────────────┬─────────────────────────────────────┬────────────┬────────────┬────────────┬───────┬─────────┐
│ quadkey | tile | avg_d_kbps | avg_u_kbps | avg_lat_ms | tests | devices │
│ --- | --- | --- | --- | --- | --- | --- │
│ str | object | i64 | i64 | i64 | i64 | i64 │
╞══════════════════╪═════════════════════════════════════╪════════════╪════════════╪════════════╪═══════╪═════════╡
│ 0022133222330023 | {'8a0d1c1306a7fff', '8a0d1c1306b... | 15600 | 14609 | 168 | 2 | 1 │
│ 0022133222330031 | {'8a0d1c130757fff', '8a0d1c13062... | 6790 | 22363 | 68 | 1 | 1 │
│ 0022133222330100 | {'8a0d1c1300d7fff', '8a0d1c1300c... | 9619 | 15757 | 92 | 6 | 1 │
└──────────────────┴─────────────────────────────────────┴────────────┴────────────┴────────────┴───────┴─────────┘
h3.polyfill()
is returning a python set object which polars does not really "recognize" as it stands.
You can convert the set to list()
and polars will give you a list[str]
column instead of object
- which you can .explode()
without error.
polars_df.with_columns(
pl.col('tile').apply(lambda x: list(h3.polyfill(shapely.geometry.mapping(wkt.loads(x)), res=10, geo_json_conformant=True)))
.alias('h3_polyfill')
).explode('h3_polyfill')
shape: (9, 8)
┌──────────────────┬─────────────────────────────────────┬────────────┬────────────┬────────────┬───────┬─────────┬─────────────────┐
│ quadkey | tile | avg_d_kbps | avg_u_kbps | avg_lat_ms | tests | devices | h3_polyfill │
│ --- | --- | --- | --- | --- | --- | --- | --- │
│ str | str | i64 | i64 | i64 | i64 | i64 | str │
╞══════════════════╪═════════════════════════════════════╪════════════╪════════════╪════════════╪═══════╪═════════╪═════════════════╡
│ 0022133222330023 | POLYGON((-160.043334960938 70.63... | 15600 | 14609 | 168 | 2 | 1 | 8a0d1c1306a7fff │
│ 0022133222330023 | POLYGON((-160.043334960938 70.63... | 15600 | 14609 | 168 | 2 | 1 | 8a0d1c1306b7fff │
│ 0022133222330023 | POLYGON((-160.043334960938 70.63... | 15600 | 14609 | 168 | 2 | 1 | 8a0d1c13079ffff │
│ 0022133222330031 | POLYGON((-160.032348632812 70.63... | 6790 | 22363 | 68 | 1 | 1 | 8a0d1c130757fff │
│ 0022133222330031 | POLYGON((-160.032348632812 70.63... | 6790 | 22363 | 68 | 1 | 1 | 8a0d1c130627fff │
│ 0022133222330031 | POLYGON((-160.032348632812 70.63... | 6790 | 22363 | 68 | 1 | 1 | 8a0d1c13070ffff │
│ 0022133222330100 | POLYGON((-160.02685546875 70.641... | 9619 | 15757 | 92 | 6 | 1 | 8a0d1c1300d7fff │
│ 0022133222330100 | POLYGON((-160.02685546875 70.641... | 9619 | 15757 | 92 | 6 | 1 | 8a0d1c1300c7fff │
│ 0022133222330100 | POLYGON((-160.02685546875 70.641... | 9619 | 15757 | 92 | 6 | 1 | 8a0d1c1300f7fff │
└──────────────────┴─────────────────────────────────────┴────────────┴────────────┴────────────┴───────┴─────────┴─────────────────┘
There's probably not much difference compared to doing the "all by hand" approach in pandas:
pandas_df['geometry'] = wkt.loads(pandas_df['tile'])
pandas_df = pandas_df.assign(
h3_polyfill=pandas_df['geometry'].map(lambda tile: h3.polyfill(shapely.geometry.mapping(tile), 10, True))
).explode('h3_polyfill')