Search code examples
pythonpython-3.xcomputational-geometrygeopandasshapely

Most efficient way to find polygon-coordinate intersections in large GeoJSON object


I'm working on a project that requires coordinate mappings - determining whether a coordinate point exists in a series of polygons. The number of mappings is quite large - ~10 million coordinates across 100+ million polygons.

Before I continue, I've already looked at the questions here and here. This question isn't redundant since it involves dynamic points and a static polygons.

I've reduced the project's scope for this question by mapping a single coordinate across a subset of 2 million polygons. This is the code I use:

from shapely.geometry import shape, Point

f = open('path/to/file.geojson', 'r')
data = json.loads(f.read())

point = Point(42.3847, -71.127411)
for feature in data['features']:
    polygon = shape(feature['geometry'])
    if polygon.contains(point):
        print(polygon)

Iterating through 2 million polygons, which in this case are building footprints, takes ~30 seconds (too much time).

I've also tried using mplPath as follows:

import matplotlib.path as mplPath

building_arrays = [np.array(data['features'][i]['geometry']['coordinates'][0])
                   for i, v in enumerate(tqdm(data['features']))]
bbPath_list = [mplPath.Path(building)
               for building in tqdm(building_arrays)]

for b in tqdm(bbPath_list):
    if b.contains_point((-71.1273842, 42.3847423)):
        print(b)

This takes ~6 seconds. An improvement, but still somewhat slow given the volume of mappings I need.

Is there any faster way to achieve such a mapping? I prefer not to use PySpark and distributed computing since I consider that a nuclear option, but I'm open to using that if need be. Is it perhaps possible to vectorize the calculation rather than iterating through the polygons? I'll produce an update showing whether using numba made any improvements.


Solution

  • I would use a spatial join.

    Given this fake data:

    enter image description here

    I'd join it with the "within" predicate:

    from shapely.geometry import Point, Polygon
    import geopandas
    
    polys = geopandas.GeoDataFrame({
        "name": ["foo", "bar"],
        "geometry": [
            Polygon([(5, 5), (5, 13), (13, 13), (13, 5)]),
            Polygon([(10, 10), (10, 15), (15, 15), (15, 10)]),
        ]
    })
    
    pnts = geopandas.GeoDataFrame({
        "pnt": ["A", "B", "C"],
        "geometry": [
            Point(3, 3), Point(8, 8), Point(11, 11)
        ]
    })
    
    result = geopandas.sjoin(pnts, polys, how='left', op='within')
    
    

    I get:

    pnt                  geometry  index_right name
      A   POINT (3.00000 3.00000)          NaN  NaN
      B   POINT (8.00000 8.00000)          0.0  foo
      C POINT (11.00000 11.00000)          0.0  foo
      C POINT (11.00000 11.00000)          1.0  bar