Search code examples
pythonpandasdatasetcoordinateshaversine

Iterate through pandas groups of coords and calculate distances


I have a csv dataset that looks like this:

    created_date,latitude,longitude
"2018-10-02 16:52:54",20.56314546,-100.40871983
"2018-10-07 18:06:37",20.56899227,-100.40879701
"2018-10-08 11:55:31",20.57479211,-100.39687493
"2018-10-08 11:55:31",20.58076244,-100.36075875
"2018-10-08 11:55:31",20.60529101,-100.40951731
"2018-10-08 11:55:31",20.60783806,-100.37852743
"2018-10-09 18:10:00",20.61098901,-100.38008197
"2018-10-09 18:10:00",20.61148848,-100.40851908
"2018-10-09 18:10:00",20.61327334,-100.34415272
"2018-10-09 18:10:00",20.61397514,-100.33583425

I am trying to use pandas to separate the data into groups by date and would then like to iterate through every group and calculate the distance between the lat,longs in every group using the haversine function which takes 2 coords as params.

In order to do so i have to calculate the distance of say coord1 with coord2, coord 2 with coord 3 and so on (from the group)

I want to do this in order to calculate the average distance traveled. I'd then have to add the distances together and divide it by the number of groups.

With pandas i managed to divide my data into groups but im not sure how to iterate through these groups, while excluding the groups (say "2018-10-02 16:52:54" ) which doesn't have 2 coords to calculate distance from.

My current python script looks like this:

col_names = ['date', 'latitude', 'longitude']
data = pd.read_csv('dataset.csv', names=col_names, sep=',', skiprows=1)
grouped = data.groupby('date')
for index, item in grouped:

Any guidance is appreciated, i have a general idea of how to do it but i'm not sure if tools like zip can help me through this.


Solution

  • Here's one option. It involves performing an enormous merge within groups, giving all pairwise combinations. Then remove all same row merges and you can calculate the distances once.

    import pandas as pd
    import numpy as np
    
    def haversine(lon1, lat1, lon2, lat2):
        # convert degrees to radians 
        lon1 = np.deg2rad(lon1)
        lat1 = np.deg2rad(lat1)
        lon2 = np.deg2rad(lon2)
        lat2 = np.deg2rad(lat2)
    
        # formula 
        dlon = lon2 - lon1 
        dlat = lat2 - lat1 
        a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
        c = 2 * np.arcsin(np.sqrt(a)) 
        r_e = 6371 
        return c * r_e
    

    Code:

    # merge
    m = df.reset_index().merge(df.reset_index(), on='created_date')
    
    # remove comparisons of the same event
    m = m[m.index_x != m.index_y].drop(columns = ['index_x', 'index_y'])
    
    # Calculate Distance
    m['Distance'] = haversine(m.longitude_x, m.latitude_x, m.longitude_y, m.latitude_y)
    

    Output: m

               created_date  latitude_x  longitude_x  latitude_y  longitude_y  Distance
    3   2018-10-08 11:55:31   20.574792  -100.396875   20.580762  -100.360759  3.817865
    4   2018-10-08 11:55:31   20.574792  -100.396875   20.605291  -100.409517  3.637698
    5   2018-10-08 11:55:31   20.574792  -100.396875   20.607838  -100.378527  4.141211
    ...
    30  2018-10-09 18:10:00   20.613975  -100.335834   20.610989  -100.380082  4.617105
    31  2018-10-09 18:10:00   20.613975  -100.335834   20.611488  -100.408519  7.569825
    32  2018-10-09 18:10:00   20.613975  -100.335834   20.613273  -100.344153  0.869261
    

    To get the average per date:

    m.groupby('created_date').Distance.mean()
    
    #created_date
    #2018-10-08 11:55:31    4.021623
    #2018-10-09 18:10:00    4.411060
    #Name: Distance, dtype: float64
    

    As we subset the merged DataFrame before, this will only provide output for created_dates with more than 1 measurement.


    To merge on date instead of an exact time:

    df['created_date'] = pd.to_datetime(df.created_date)
    df['ng'] = df.groupby(df.created_date.dt.date).ngroup()
    
    m = df.reset_index().merge(df.reset_index(), on='ng')
    m = m[m.index_x != m.index_y].drop(columns = ['index_x', 'index_y'])
    
    ...