Search code examples
pandasdataframegpscoordinateshaversine

Remove GPS points if haversine distance to previous point is less than a certain value


I have a pandas dataframe with GPS coordinates

import pandas as pd
d1 = {'user': ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B'], 
      'longitude': [-122.419576048851, -122.4196457862854, -122.41975843906403, -122.41981744766234, -122.41961896419524, -122.41846561431885, -122.41841197013854, -122.41860508918761, -122.41830468177795, -122.41655588150023, -122.416330575943, -122.41608381271362, -122.41587996482849, -122.41443157196045, -122.41400241851807, -122.4145495891571, -122.28513300418852, -122.28403329849243, -122.28397965431215, -122.28369534015657, -122.28364706039427, -122.28360414505003, -122.28335201740265, -122.28326618671417, -122.28309988975525, -122.2829818725586, -122.28216111660002,  -122.28297650814056], 
      'latitude':[37.77727010900716, 37.77759235026598, 37.778147789138536, 37.778291948163755, 37.77833010785869, 37.77846154665706, 37.77932225301237, 37.780250787054555, 37.78027198632572, 37.78056029581, 37.78059421449895, 37.78061965350541, 37.78064509250312, 37.780848604169755, 37.7822816496242, 37.784647385762014, 37.81233951943745, 37.812068286068886, 37.81228018722322, 37.81312354779044, 37.813237972853855, 37.813365111605194, 37.814017753748836, 37.8141830323372, 37.814161842795265, 37.81414489115734, 37.814009277913826, 37.81183095605405]}
df1 = pd.DataFrame(data=d1)

Using the following haversine function I'm able to calculate the distance between consecutive points of the GPS trajectory (grouped per user)

# Define haversine function
def haversine(lat1, lon1, lat2, lon2, earth_radius=6371):
    lat1, lon1, lat2, lon2 = np.radians([lat1, lon1, lat2, lon2])

    a = np.sin((lat2-lat1)/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin((lon2-lon1)/2.0)**2
    km = earth_radius * 2 * np.arcsin(np.sqrt(a))
    m = km * 1000
    return pd.DataFrame(m)

df1['distance'] = df1.groupby('user').apply(lambda x: haversine(x['latitude'], 
                                                                x['longitude'],
                                                                x['latitude'].shift(), 
                                                                x['longitude'].shift())).values
df1['distance'] = df1['distance'].fillna(0)

    user    longitude   latitude    distance
0   A       -122.419576 37.777270   0.000000
1   A       -122.419646 37.777592   36.352012
2   A       -122.419758 37.778148   62.550525
3   A       -122.419817 37.778292   16.847806
4   A       -122.419619 37.778330   17.952766
5   A       -122.418466 37.778462   102.412611
6   A       -122.418412 37.779322   95.822233
7   A       -122.418605 37.780251   104.633961
8   A       -122.418305 37.780272   26.506241
9   A       -122.416556 37.780560   157.000401
10  A       -122.416331 37.780594   20.156826
11  A       -122.416084 37.780620   21.870313
12  A       -122.415880 37.780645   18.136963
13  A       -122.414432 37.780849   129.286601
14  A       -122.414002 37.782282   163.749922
15  A       -122.414550 37.784647   267.416687
16  B       -122.285133 37.812340   0.000000
17  B       -122.284033 37.812068   101.203952
18  B       -122.283980 37.812280   24.028959
19  B       -122.283695 37.813124   97.046376
20  B       -122.283647 37.813238   13.411732
21  B       -122.283604 37.813365   14.631208
22  B       -122.283352 37.814018   75.875008
23  B       -122.283266 37.814183   19.864639
24  B       -122.283100 37.814162   14.797045
25  B       -122.282982 37.814145   10.537113
26  B       -122.282161 37.814009   73.658945
27  B       -122.282977 37.811831   252.587420

Now I would like to write a function that removes the second, i.e. the following GPS point if the distance is less than 50 meters compared to its predessor. The function should always keep the last point/feature of the trajectory, regardless of the distance between the previous kept feature. The first point should also always be kept.

Any ideas how to achieve this?


Solution

  • A solution that you could insert into a function is the following:

    You wish to keep the first and last instance for each user. So this can be achieved by

    g = df.groupby('user')
    df2 = pd.concat([g.head(1), g.tail(1)])
    

    which is

      user   longitude   latitude    distance
    0     A -122.419576  37.777270    0.000000
    16    B -122.285133  37.812340    0.000000
    15    A -122.414550  37.784647  267.416687
    27    B -122.282977  37.811831  252.587420
    

    then, to determine the differences in distance, droping rows if the distance is less than 50 and concatenating with the first and last rows of each group as well as sorting by index:

    df = df.drop(df[df.distance< 50].index)
    df_new = pd.concat([df,df2]).sort_index()
    df_new = df_new.drop_duplicates()
    

    which gives

     user   longitude   latitude    distance
    0     A -122.419576  37.777270    0.000000
    2     A -122.419758  37.778148   62.550525
    5     A -122.418466  37.778462  102.412611
    6     A -122.418412  37.779322   95.822233
    7     A -122.418605  37.780251  104.633961
    9     A -122.416556  37.780560  157.000401
    13    A -122.414432  37.780849  129.286601
    14    A -122.414002  37.782282  163.749922
    15    A -122.414550  37.784647  267.416687
    16    B -122.285133  37.812340    0.000000
    17    B -122.284033  37.812068  101.203952
    19    B -122.283695  37.813124   97.046376
    22    B -122.283352  37.814018   75.875008
    26    B -122.282161  37.814009   73.658945
    27    B -122.282977  37.811831  252.587420
    
    

    Not the most beautiful function, but it works:

    def Drop_values(df):
        g = df.groupby('user')
        df2 = pd.concat([g.head(1), g.tail(1)])
        df = df.drop(df[df.distance< 50].index)
        df_new = pd.concat([df,df2]).sort_index()
        df_new = df_new.drop_duplicates()
        return(df_new)