Search code examples
pythonpandasgpshaversinedownsampling

downsampling gps data using haversine formula using python


I have a high frequency of gps data which i want to downsample to every 50 meters ie keep gps latitude and longitude every 50 meter and discard inbetween points. I found a python code on the internet which basically calculates the distance between two points. But i am not sure how to basically read from a csv the lat and long values and feed it into the function and calculate the distance. If the distance reaches 50 meter i simply save that gps coordinates. So far, i have the following python code

from math import radians, cos, sin, asin, sqrt
def haversine(lon1, lat1, lon2, lat2):
    lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])

    # haversine formula 
    dlon = lon2 - lon1 
    dlat = lat2 - lat1 
    a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
    c = 2 * asin(sqrt(a)) 
    r = 6371 # Radius of earth in kilometers. Use 3956 for miles
    return c * r

x1 = 52.19421607 
x2 = 52.20000327 
y1 = -1.484984011 
y2 = -1.48533465
result = haversine(x1,y1,x2,y2)    #need to give input from a csv
#if result is greater than 50m , save the coordinates
print(result)   

How can i solve the problem?Any direction would be appreciated.


Solution

  • Here is a outline and a working code example - where I made some assumptions about which to keep/drop. I assume the dataframe is sorted.

    1. First calculate distance to next point, indeed use haversine for lat/long pairs. This part is not fast in my implementation - you can find faster.
    2. Use cumsum() of distances, to create distance groups, where group 1 is all distances below 50, group 2 between 50 and 100, etc...
    3. Within each group, keep for instance only the first()

    Note that this is approximately each 50 units based on group, so be aware this is different than take a point and jump to next point which is closest to 50 units away and repeat. But for data reduction purposes it should be fine.

    Generate some random data around London.

    import numpy as np
    import sklearn
    import pandas as pd
    
    LONDON =  (51.509865, -0.118092)
    
    random_gps = np.random.random( (10000,2) ) / 25
    random_gps[:,0] += np.arange(random_gps.shape[0]) / 25
    
    random_gps[:,0] += LONDON[0]
    random_gps[:,1] += LONDON[1]
    
    gps_data = pd.DataFrame( random_gps, columns=["lat","long"] )
    

    Shift the data to get the lat/long of the next point

    gps_data['next_lat'] = gps_data.lat.shift(1)
    gps_data['next_long'] = gps_data.long.shift(1)
    
    gps_data.head()
    

    Define the distance metric. This part can be improved in terms of speed by using vector expressions with numpy, so if speed is important change this part.

    from sklearn.neighbors import DistanceMetric
    
    dist = DistanceMetric.get_metric('haversine')
    
    EARTH_RADIUS = 6371.009
    
    def haversine_distance(row):
        point_a = np.array([[row.lat, row.long]])
        point_b = np.array([[row.next_lat, row.next_long]])
        return EARTH_RADIUS * dist.pairwise(np.radians(point_a), np.radians(point_b) )[0][0]
        
    

    and apply our distance function (slow part, which can be improved)

    gps_data["distance_to_next"] = gps_data.apply( haversine_distance, axis=1)
    gps_data["distance_cumsum"] = gps_data.distance_to_next.cumsum()
    

    Finally, create groups and drop. (!) The haversine is returning the distance in KM - so here i wrongly did an example of 50 km instead of meters.

    gps_data["distance_group"] = gps_data.distance_cumsum // 50
    
    filtered = gps_data.groupby(['distance_group']).first()