Search code examples
pythonpandasdistancelatitude-longitudegeopy

How to utilize .apply on functions from geopy in pandas when creating new column from existing columns


So I am trying to find a more efficient way of doing a task I already made some code for. The purpose of the code is to use 4 columns (LATITUDE, LONGITUDE, YORK_LATITUDE, YORK_LONGITUDE) to create a new column which calculates the distance between two coordinates in kilometers in a panda dataframe. Where the first coordinate is (LATITUDE, LONGITUDE) and the second coordinate is (YORK_LATITUDE, YORK_LONGITUDE).

A link of what the table looks like

In order to complete the task right now I create a list using the following code (geopy and pandas iterrows), convert that into a column and concatenate that to the dataframe. This is cumbersome, I know that there is an easier way to utilize .apply and the geopy function, but I haven't been able to figure out the syntax.

from geopy.distance import geodesic as GD
list = []
for index, row in result.iterrows():
    coordinate1 = (row['LATITUDE'], row['LONGITUDE'])
    coordinate2 = (row['LATITUDE_YORK_UNIVERSITY'], row['LONGITUDE_YORK_UNIVERSITY'])
    list.append(GD(coordinate1, coordinate2).km)

Solution

  • TL;DR

    df.apply(lambda x: distance(x[:2], x[2:]), axis=1)
    

    Some explanation

    Let's say we have a function, which requires two tuples as arguments. For example:

    from math import dist
    
    def distance(point1: tuple, point2: tuple) -> float:
        
        # suppose that developer checks the type
        # so we can pass only tuples as arguments
        assert type(point1) is tuple
        assert type(point2) is tuple
    
        return dist(point1, point2)
    

    Let's apply the function to this data:

    df = pd.DataFrame(
        data=np.arange(10*4).reshape(10, 4),
        columns=['long', 'lat', 'Y long', 'Y lat']
    )
    

    We pass to apply two parameters: axis=1 to iterate over rows, and a wrapper over distance as a lambda-function. To split the row in tuples we can apply tuple(...) or `(*...,), note the comma at the end in the latter option:

    df.apply(lambda x: distance((*x[:2],), (*x[2:],)), axis=1)
    

    The thing is that geopy.distance doesn't require exactly tuples as an arguments, they can be any iterables with 2 to 3 elements (see the endpoint how an argument is transformed into the inner type Point while defining distance). So we can simplify this to:

    df.apply(lambda x: distance(x[:2], x[2:]), axis=1)
    

    To make it independent from the columns order we could write this (in your terms):

    common_point = ['LATITUDE','LONGITUDE']
    york_point = ['LATITUDE_YORK_UNIVERSITY','LONGITUDE_YORK_UNIVERSITY']
    result.apply(lambda x: GD(x[common_point], x[york_point]).km, axis=1)