Search code examples
pandasgeopandasshapely

Dropping Duplicate Points


I have two geodataframes or geoseries, both consists of thousands of points.

My requirement is to append (merge) both geodataframes and drop duplicate points.

In other words, output = gdf1 all points + gdf2 points that do not intersect with gdf1 points

I tried as:

output = geopandas.overlay(gdf1, gdf2, how='symmetric_difference')

However, it is very slow.

Do you know any faster way of doing it ?


Solution

  • Here is another way of combining dataframes using pandas, along with timings, versus geopandas:

    import pandas as pd
    import numpy as np
    
    data1 = np.random.randint(-100, 100, size=10000)
    data2 = np.random.randint(-100, 100, size=10000)
    
    df1 = pd.concat([-pd.Series(data1, name="longitude"), pd.Series(data1, name="latitude")], axis=1)
    df1['geometry'] = df1.apply(lambda x: (x['latitude'], x['longitude']), axis=1)
    
    df2 = pd.concat([-pd.Series(data2, name="longitude"), pd.Series(data2, name="latitude")], axis=1)
    df2['geometry'] = df2.apply(lambda x: (x['latitude'], x['longitude']), axis=1)
    
    df1 = df1.set_index(["longitude", "latitude"])
    df2 = df2.set_index(["longitude", "latitude"])
    
    %timeit pd.concat([df1[~df1.index.isin(df2.index)],df2[~df2.index.isin(df1.index)]])
        
    112 ms ± 217 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
    

    This seems a lot faster than using geopandas

    import geopandas as gp
    
    gdf1 = gp.GeoDataFrame(
        df1, geometry=gp.points_from_xy(df1.index.get_level_values("longitude"), df1.index.get_level_values("latitude")))
    gdf2 = gp.GeoDataFrame(
        df2, geometry=gp.points_from_xy(df2.index.get_level_values("longitude"), df2.index.get_level_values("latitude")))
    
    %timeit gp.overlay(gdf1, gdf2, how='symmetric_difference')
    
    29 s ± 317 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    

    But maybe you need some kind of optimisations as mentioned here.

    The function checks for non-matching indexes from each df and then combines them.

    df1 = pd.DataFrame([1,2,3,4],columns=['col1']).set_index("col1")
    df2 = pd.DataFrame([3,4,5,6],columns=['col1']).set_index("col1")
    pd.concat([df1[~df1.index.isin(df2.index)],df2[~df2.index.isin(df1.index)]])
    
    col1
    1
    2
    5
    6