Search code examples
pythonpandasgeopandasshapely

How to speed up creating Point GeoSeries with large data?


I have two 1D arrays and want to combine them into one Point GeoSeries like this:

import numpy as np
from geopandas import GeoSeries
from shapely.geometry import Point

x = np.random.rand(int(1e6))
y = np.random.rand(int(1e6))
GeoSeries(map(Point, zip(x, y)))

It costs about 5 seconds on my laptop. Is it possible to accelerate the generation of GeoSeries?


Solution

  • Instead of using map, to speed up this process, you need to use vectorized operations. points_from_xy function provided by GeoPandas is specifically optimized for this purpose. Here's an example run on my machine:

    import numpy as np
    from geopandas import GeoSeries
    from shapely.geometry import Point
    import geopandas as gpd
    import time
    
    x = np.random.rand(int(1e6))
    y = np.random.rand(int(1e6))
    
    s = time.time()
    
    GeoSeries(map(Point, zip(x, y)))
    
    f = time.time()
    print("time elapsed with `map` : ", f - s)
    
    geo_series = gpd.GeoSeries(gpd.points_from_xy(x, y))
    
    print("time elapsed with `points_from_xy` : ", time.time() - f)
    

    Output:

    time elapsed with `map` :  9.318699359893799
    time elapsed with `points_from_xy` :  0.654371976852417
    

    see, the points_from_xy is almost 10x times faster as this utilized a vectorized approach.

    Checkout geopandas.points_from_xy documentation from here to learn more: https://geopandas.org/en/stable/docs/reference/api/geopandas.points_from_xy.html