Search code examples
pythonperformancenumpypandasvectorization

Vectorizing Haversine distance calculation in Python


I am trying to calculate a distance matrix for a long list of locations identified by Latitude & Longitude using the Haversine formula that takes two tuples of coordinate pairs to produce the distance:

def haversine(point1, point2, miles=False):
    """ Calculate the great-circle distance bewteen two points on the Earth surface.

    :input: two 2-tuples, containing the latitude and longitude of each point
    in decimal degrees.

    Example: haversine((45.7597, 4.8422), (48.8567, 2.3508))

    :output: Returns the distance bewteen the two points.
    The default unit is kilometers. Miles can be returned
    if the ``miles`` parameter is set to True.

    """

I can calculate the distance between all points using a nested for loop as follows:

data.head()

   id                      coordinates
0   1   (16.3457688674, 6.30354512503)
1   2    (12.494749307, 28.6263955635)
2   3    (27.794615136, 60.0324947881)
3   4   (44.4269923769, 110.114216113)
4   5  (-69.8540884125, 87.9468778773)

using a simple function:

distance = {}
def haver_loop(df):
    for i, point1 in df.iterrows():
        distance[i] = []
        for j, point2 in df.iterrows():
            distance[i].append(haversine(point1.coordinates, point2.coordinates))

    return pd.DataFrame.from_dict(distance, orient='index')

But this takes quite a while given the time complexity, running at around 20s for 500 points and I have a much longer list. This has me looking at vectorization, and I've come across numpy.vectorize ((docs), but can't figure out how to apply it in this context.


Solution

  • You would provide your function as an argument to np.vectorize(), and could then use it as an argument to pandas.groupby.apply as illustrated below:

    haver_vec = np.vectorize(haversine, otypes=[np.int16])
    distance = df.groupby('id').apply(lambda x: pd.Series(haver_vec(df.coordinates, x.coordinates)))
    

    For instance, with sample data as follows:

    length = 500
    df = pd.DataFrame({'id':np.arange(length), 'coordinates':tuple(zip(np.random.uniform(-90, 90, length), np.random.uniform(-180, 180, length)))})
    

    compare for 500 points:

    def haver_vect(data):
        distance = data.groupby('id').apply(lambda x: pd.Series(haver_vec(data.coordinates, x.coordinates)))
        return distance
    
    %timeit haver_loop(df): 1 loops, best of 3: 35.5 s per loop
    
    %timeit haver_vect(df): 1 loops, best of 3: 593 ms per loop