Search code examples
pandasdataframegeopy

How to get the distance between two geographic coordinates of two different dataframes?


I am working on a project for university, where I have two pandas dataframes:

      # Libraries
      import pandas as pd
      from geopy import distance

      # Dataframes

      df1 = pd.DataFrame({'id': [1,2,3],                   
                          'lat':[-23.48, -22.94, -23.22],
                          'long':[-46.36, -45.40, -45.80]})

       df2 = pd.DataFrame({'id': [100,200,300],                   
                           'lat':[-28.48, -22.94, -23.22],
                           'long':[-46.36, -46.40, -45.80]})

I need to calculate distances between geographic latitude and longitude coordinates between dataframes. So I used geopy. If the distance between the coordinate combination is less than a threshold of 100 meters, then I must assign the value 1 in the 'nearby' column. I made the following code:

      threshold = 100  # meters

      df1['nearby'] = 0

      for i in range(0, len(df1)):
          for j in range(0, len(df2)):

              coord_geo_1 = (df1['lat'].iloc[i], df1['long'].iloc[i])
              coord_geo_2 = (df2['lat'].iloc[j], df2['long'].iloc[j])

              var_distance = (distance.distance(coord_geo_1, coord_geo_2).km) * 1000 

              if(var_distance < threshold):
                   df1['nearby'].iloc[i] = 1

Although a warning appears, the code is working. However, I would like to find a way to override for() iterations. It's possible?

       # Output:

       id   lat       long  nearby
        1   -23.48  -46.36    0
        2   -22.94  -45.40    0
        3   -23.22  -45.80    1

Solution

  • If you can use the library scikit-learn, the method haversine_distances calculate the distance between two sets of coordinates. so you get:

    from sklearn.metrics.pairwise import haversine_distances
    
    # variable in meter you can change
    threshold = 100 # meters
    
    # another parameter
    earth_radius = 6371000  # meters
    
    df1['nearby'] = (
        # get the distance between all points of each DF
        haversine_distances(
            # note that you need to convert to radiant with *np.pi/180
            X=df1[['lat','long']].to_numpy()*np.pi/180, 
            Y=df2[['lat','long']].to_numpy()*np.pi/180)
        # get the distance in meter
        *earth_radius
        # compare to your threshold
        < threshold
        # you want to check if any point from df2 is near df1
        ).any(axis=1).astype(int)
    
    print(df1)
    
    #    id    lat   long  nearby
    # 0   1 -23.48 -46.36       0
    # 1   2 -22.94 -45.40       0
    # 2   3 -23.22 -45.80       1
    

    EDIT: OP ask for a version with distance from geopy, so here is a way.

    df1['nearby'] = (np.array(
        [[(distance.distance(coord1, coord2).km)
          for coord2 in df2[['lat','long']].to_numpy()] 
         for coord1 in df1[['lat','long']].to_numpy()]
         ) * 1000 < threshold
    ).any(1).astype(int)