I'm working on a project that requires coordinate mappings - determining whether a coordinate point exists in a series of polygons. The number of mappings is quite large - ~10 million coordinates across 100+ million polygons.
Before I continue, I've already looked at the questions here and here. This question isn't redundant since it involves dynamic points and a static polygons.
I've reduced the project's scope for this question by mapping a single coordinate across a subset of 2 million polygons. This is the code I use:
from shapely.geometry import shape, Point
f = open('path/to/file.geojson', 'r')
data = json.loads(f.read())
point = Point(42.3847, -71.127411)
for feature in data['features']:
polygon = shape(feature['geometry'])
if polygon.contains(point):
print(polygon)
Iterating through 2 million polygons, which in this case are building footprints, takes ~30 seconds (too much time).
I've also tried using mplPath
as follows:
import matplotlib.path as mplPath
building_arrays = [np.array(data['features'][i]['geometry']['coordinates'][0])
for i, v in enumerate(tqdm(data['features']))]
bbPath_list = [mplPath.Path(building)
for building in tqdm(building_arrays)]
for b in tqdm(bbPath_list):
if b.contains_point((-71.1273842, 42.3847423)):
print(b)
This takes ~6 seconds. An improvement, but still somewhat slow given the volume of mappings I need.
Is there any faster way to achieve such a mapping? I prefer not to use PySpark and distributed computing since I consider that a nuclear option, but I'm open to using that if need be. Is it perhaps possible to vectorize the calculation rather than iterating through the polygons? I'll produce an update showing whether using numba
made any improvements.
I would use a spatial join.
Given this fake data:
I'd join it with the "within" predicate:
from shapely.geometry import Point, Polygon
import geopandas
polys = geopandas.GeoDataFrame({
"name": ["foo", "bar"],
"geometry": [
Polygon([(5, 5), (5, 13), (13, 13), (13, 5)]),
Polygon([(10, 10), (10, 15), (15, 15), (15, 10)]),
]
})
pnts = geopandas.GeoDataFrame({
"pnt": ["A", "B", "C"],
"geometry": [
Point(3, 3), Point(8, 8), Point(11, 11)
]
})
result = geopandas.sjoin(pnts, polys, how='left', op='within')
I get:
pnt geometry index_right name
A POINT (3.00000 3.00000) NaN NaN
B POINT (8.00000 8.00000) 0.0 foo
C POINT (11.00000 11.00000) 0.0 foo
C POINT (11.00000 11.00000) 1.0 bar