Search code examples
pythonpandascsvmatrixgraph

Matrix of distances from csv file of lat and lon


I have a csv file with places and latitudes and longitudes. I want to create matrix based on them. I tried creating matrix using:

arr = df['latitude'].values - df['latitude'].values[:, None]
pd.concat((df['name'], pd.DataFrame(arr, columns=df['name'])), axis=1)

but it only creates matrix with latitude values and I want to calculate distance between places.So the matrix I want to get will be the matrix of distances between all of the hotels.

enter image description here


Solution

  • Based on the answer of @ravenspoint here a simple code to calculate distance.

    >>> import numpy as np
    >>> import pandas as pd
    >>> import geopy.distance
    
    >>> data = {"hotels": ["1", "2", "3", "4"], "lat": [20.55697, 21.123698, 25.35487, 19.12577], "long": [17.1, 18.45893, 16.78214, 14.75498]}
    
    >>> df = pd.DataFrame(data)
    >>> df
    
    hotels lat        long
    1      20.556970  17.10000
    2      21.123698  18.45893
    3      25.354870  16.78214
    4      19.125770  14.75498
    

    Now lets create a matrix to map distance between hotels. The matrix should have the size (nbr of hotels x nbr of hotels).

    >>> matrix = np.ones((len(df), len(df))) * -1
    >>> np.fill_diagonal(matrix, 0)
    >>> matrix
    
    array([[ 0., -1., -1., -1.],
           [-1.,  0., -1., -1.],
           [-1., -1.,  0., -1.],
           [-1., -1., -1.,  0.]])
    

    So here -1 is to avoid the calculation of the same distance twice as dist(1,2) = dist(2,1).

    Next, just loop over hotels and calculate the distance. Here the geopy package is used.

    >>> for i in range(len(df)):
        coords_i = df.loc[i, ["lat", "long"]].values
        for j in range(i+1, len(df)):
            coords_j = df.loc[j, ["lat", "long"]].values
            matrix[i,j] = geopy.distance.geodesic(coords_i, coords_j).km
    
    >>> matrix
    
    array([[  0.        , 154.73003254, 532.33605633, 292.29813424],
           [ -1.        ,   0.        , 499.00500751, 445.97821702],
           [ -1.        ,  -1.        ,   0.        , 720.69054683],
           [ -1.        ,  -1.        ,  -1.        ,   0.        ]])
    

    Please note that the nested loop is not the best way to do the job, and the code can be enhanced.