Search code examples
pythonpandasgeopy

python: recursively find the distance between points in a group


I can apply vincenty in geopy to my dataframe in pandas and determine the distance between the two consecutive machines. However, I want to find the distance between all the machines in the group without repeating.

For example, if I group by company name and there are 3 machines associated with this company, I would want to find the distance between machine 1 and 2, 1 and 3, and (2 and 3) but not calculate the distance between (2 and 1) and (3 and 1) since they are symmetric (identical results).

import pandas as pd
from geopy.distance import vincenty

df = pd.DataFrame({'ser_no': [1, 2, 3, 4, 5, 6, 7, 8, 9, 0],
                'co_nm': ['aa', 'aa', 'aa', 'bb', 'bb', 'bb', 'bb', 'cc', 'cc', 'cc'],
                'lat': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
                'lon': [21, 22, 23, 24, 25, 26, 27, 28, 29, 30]})

coord_col = ['lat', 'lon']
matching_cust = df['co_nm'] == df['co_nm'].shift(1)
shift_coords = df.shift(1).loc[matching_cust, coord_col]
# join in shifted coords and compute distance
df_shift = df.join(shift_coords, how = 'inner', rsuffix = '_2')
# return distance in miles
df['dist'] = df_shift.apply(lambda x: vincenty((x[1], x[2]), 
    (x[4], x[5])).mi, axis = 1)

This only finds the distance of consecutive machines in the group how can I expand on this to find the distance of all machines in the group?

This code returns:

  co_nm  lat  lon  ser_no      dist
0    aa    1   21       1       NaN
1    aa    2   22       2  97.47832
2    aa    3   23       3  97.44923
3    bb    4   24       4       NaN
4    bb    5   25       5  97.34752
5    bb    6   26       6  97.27497
6    bb    7   27       7  97.18804
7    cc    8   28       8       NaN
8    cc    9   29       9  96.97129
9    cc   10   30       0  96.84163

Edit:

The desired output would find the unique distance combinations for machines related by company; that is, for co_nm aa we would have the distance between ser_no (1,2), (1,3), (2,3), (1,3) and the distance for the machines in co_nm bb and cc as well, but we wouldn't determine the distance of machines in different co_nm groups.

Does this make sense?


Solution

  • UPDATE2: using function:

    def calc_dist(df):
        return pd.DataFrame(
                   [ [grp,
                      df.loc[c[0]].ser_no,
                      df.loc[c[1]].ser_no,
                      vincenty(df.loc[c[0], ['lat','lon']], df.loc[c[1], ['lat','lon']])
                     ]
                     for grp,lst in df.groupby('co_nm').groups.items()
                     for c in combinations(lst, 2)
                   ],
                   columns=['co_nm','machineA','machineB','distance'])
    
    In [27]: calc_dist(df)
    Out[27]:
       co_nm  machineA  machineB               distance
    0     aa         1         2  156.87614939082016 km
    1     aa         1         3   313.7054454472326 km
    2     aa         2         3    156.829329105069 km
    3     cc         8         9  156.06016539095216 km
    4     cc         8         0   311.9109981692541 km
    5     cc         9         0  155.85149813446617 km
    6     bb         4         5  156.66564183673603 km
    7     bb         4         6   313.2143330250297 km
    8     bb         4         7   469.6225353388079 km
    9     bb         5         6  156.54889741438788 km
    10    bb         5         7  312.95759746593706 km
    11    bb         6         7   156.4089967703544 km
    

    UPDATE:

    In [9]: dist = pd.DataFrame(
       ...:   [ [grp,
       ...:      df.loc[c[0]].ser_no,
       ...:      df.loc[c[1]].ser_no,
       ...:      vincenty(df.loc[c[0], ['lat','lon']], df.loc[c[1], ['lat','lon']])
       ...:     ]
       ...:     for grp,lst in df.groupby('co_nm').groups.items()
       ...:     for c in combinations(lst, 2)
       ...:   ],
       ...:   columns=['co_nm','machineA','machineB','distance'])
    
    In [10]: dist
    Out[10]:
       co_nm  machineA  machineB               distance
    0     aa         1         2  156.87614939082016 km
    1     aa         1         3   313.7054454472326 km
    2     aa         2         3    156.829329105069 km
    3     cc         8         9  156.06016539095216 km
    4     cc         8         0   311.9109981692541 km
    5     cc         9         0  155.85149813446617 km
    6     bb         4         5  156.66564183673603 km
    7     bb         4         6   313.2143330250297 km
    8     bb         4         7   469.6225353388079 km
    9     bb         5         6  156.54889741438788 km
    10    bb         5         7  312.95759746593706 km
    11    bb         6         7   156.4089967703544 km
    

    Explanation: combination part

    In [11]: [c
       ....:  for grp,lst in df.groupby('co_nm').groups.items()
       ....:  for c in combinations(lst, 2)]
    Out[11]:
    [(0, 1),
     (0, 2),
     (1, 2),
     (7, 8),
     (7, 9),
     (8, 9),
     (3, 4),
     (3, 5),
     (3, 6),
     (4, 5),
     (4, 6),
     (5, 6)]
    

    OLD answer:

    In [3]: from itertools import combinations
    
    In [4]: import pandas as pd
    
    In [5]: from geopy.distance import vincenty
    
    In [6]: df = pd.DataFrame({'machine': [1,2,3], 'lat': [11, 12, 13], 'lon': [21,22,23]})
    
    In [7]: df
    Out[7]:
       lat  lon  machine
    0   11   21        1
    1   12   22        2
    2   13   23        3
    
    In [8]: dist = pd.DataFrame(
       ...:   [ [df.loc[c[0]].machine,
       ...:      df.loc[c[1]].machine,
       ...:      vincenty(df.loc[c[0], ['lat','lon']], df.loc[c[1], ['lat','lon']])
       ...:     ]
       ...:     for c in combinations(df.index, 2)
       ...:   ],
       ...:   columns=['machineA','machineB','distance'])
    
    In [9]: dist
    Out[9]:
       machineA  machineB               distance
    0         1         2   155.3664523771998 km
    1         1         3   310.4557192973811 km
    2         2         3  155.09044419651156 km