Search code examples
pythonpandasgeolocationpyarrow

How to filter rows using custom function in pyarrow


I've a parquet dataset that contains latitude and longitude values as separate columns. And I want to filter those rows that are inside a polygon, I'm able to do this in pandas dataframe but unable to do in pyarrow table.

I'm using pyarrow to read the parquet files, as it's quite fast.

Here's how I'm doing this in pandas:

import pyarrow as pa
from shapely.geometry import shape, Point

def point_in_polygon(df, polygon): 
    return df.apply(lambda x: shape(polygon).intersects(Point(x.lon, x.lat)), axis=1)

res: pa.Table = ParquetDataset(....)
res.to_pandas().loc[lambda df: point_in_polygon(df, polygon)]

But the problem with above approach is that it's quite slow. I know of filters in pyarrow and pyarrow.compute but unable to figure out how I can achieve this.

If more information is needed please let me know :)
Thanks :)


Solution

  • It's slow because of Python function calls looped over scalar points. Shapely supports universal functions on numpy arrays. The key is to get an array of points with the loop in-lined. Assuming you have arrays (numpy or pyarrow) of lons and lats.

    points = shapely.from_ragged_array(shapely.GeometryType.POINT, np.array([lons, lats]).T)
    shape(polygon).intersects(points)