Search code examples
pythonhaversinedistance-matrix

How do I calculate a large distance matrix with the haversine library in python?


I have a small set and a large set of locations and I need to know the geographic distance between the locations in these sets. An example of my datasets (they have the same structure, but one is larger):

     location        lat      long
0      Gieten  53.003312  6.763908
1    Godlinze  53.372605  6.814674
2  Grijpskerk  53.263894  6.306134
3   Groningen  53.219065  6.568008

In order to calculate the distances, I am using the haversine library. The haversine function wants the input to look like this:

lyon = (45.7597, 4.8422) # (lat, lon)
london = (51.509865, -0.118092)
paris = (48.8567, 2.3508)
new_york = (40.7033962, -74.2351462)

haversine_vector([lyon, london], [paris, new_york], Unit.KILOMETERS, comb=True)

after which the output looks like this:

array([[ 392.21725956,  343.37455271],
      [6163.43638211, 5586.48447423]])

How do I get the function to calculate a distance matrix with my two datasets without adding all the locations separately? I have tried using dictionaries and I have tried looping over the locations in both datasets, but I can't seem to figure it out. I am pretty new to python, so if someone has a solution that is easy to understand but not very elegant I would prefer that over lambda functions and such. Thanks!


Solution

  • You are on the right track using haversine.haversine_vector.

    Since I'm not sure how you got your dataset, this is a self-contained example using CSV datasets, but so long as you get lists of cities and coordinates somehow, you should be able to work it out.

    Note that this does not compute distances between cities in the same array (e.g. not Helsinki <-> Turku) – if you want that too, you could concatenate your two datasets into one and pass it to haversine_vector twice.

    import csv
    
    import haversine
    
    
    def read_csv_data(csv_data):
        cities = []
        locations = []
        for (city, lat, lng) in csv.reader(csv_data.strip().splitlines(), delimiter=";"):
            cities.append(city)
            locations.append((float(lat), float(lng)))
        return cities, locations
    
    
    cities1, locations1 = read_csv_data(
        """
    Gieten;53.003312;6.763908
    Godlinze;53.372605;6.814674
    Grijpskerk;53.263894;6.306134
    Groningen;53.219065;6.568008
    """
    )
    
    cities2, locations2 = read_csv_data(
        """
    Turku;60.45;22.266667
    Helsinki;60.170833;24.9375
    """
    )
    distance_matrix = haversine.haversine_vector(locations1, locations2, comb=True)
    distances = {}
    
    for y, city2 in enumerate(cities2):
        for x, city1 in enumerate(cities1):
            distances[city1, city2] = distance_matrix[y, x]
    
    print(distances)
    

    This prints out e.g.

    {
        ("Gieten", "Turku"): 1251.501257597515,
        ("Godlinze", "Turku"): 1219.2012174066822,
        ("Grijpskerk", "Turku"): 1251.3232414412073,
        ("Groningen", "Turku"): 1242.8700308545722,
        ("Gieten", "Helsinki"): 1361.4575055586013,
        ("Godlinze", "Helsinki"): 1331.2811273683897,
        ("Grijpskerk", "Helsinki"): 1364.5464743878606,
        ("Groningen", "Helsinki"): 1354.8847270142198,
    }