Search code examples
pythonmemoryruntimevectorizationgeopandas

Need to compute distances between each of 60,000 coordinates


I am working on a correlative study in Python that needs a matrix of distances between each pair of coordinates in a data set with 60,000 data points. I have attempted vectorization, and using geopandas, but the problem with geopandas is that to run the distance function, I need the x and y data in repeating lists (the x data repeating the set of 60,000 towers and the y repeating each coordinate 60,000 times in a row) making each list 3.6e9 values long, and my computer runs out of memory before this is completed, or when I try to run it on a remote desktop at my school, it takes more than half an hour and I haven't been able to run it successfully. Here is the code I am running:

#Florida Tower Matrix 
#take coordinates of Florida towers
#CHECK THE LAT/LONG order 
import geojson
import pandas as pd
import geopandas as gpd
from shapely.geometry import Point
features = []
with open("/Users/katcn/Desktop/Spring 2024/Research/PleaseWork.geojson") as f:
   gj = geojson.load(f)

for i in range(59629):
   features.append(gj['features'][i]["geometry"]['coordinates'])


#OR make the X matrix all in one column 
#make the Y matrix repeat each value 59000 times 
longitude = []
latitude = []
for i in range(len(features)):
   for j in range(len(features)):
       longitude.append(features[j][0])
   for k in range(len(features)):
       latitude.append(features[i][0])


dict = {"longitude" : longitude, "latitude" : latitude}
df = pd.DataFrame(dict)
dict2 = {"longitude" : longitude, "latitude" : latitude}
df2 = pd.DataFrame(dict2)
#calculate distance between two towers 
geometry = [Point(xy) for xy in zip(df.longitude, df.latitude)]
gdf = gpd.GeoDataFrame(df, crs={'init': 'epsg:4326'}, geometry=geometry)

geometry2 = [Point(xy) for xy in zip(df2.longitude, df2.latitude)]
gdf2 = gpd.GeoDataFrame(df2, crs={'init': 'epsg:4326'}, geometry=geometry2)

distances = gdf.geometry.distance(gdf2.geometry)

print(distances)

Any suggestions for how to approach this problem differently to make it a more reasonable run-time would be great.


Solution

  • This answer is to solve the issue that immediately follows, namely saving the results you obtain to a csv-file. Given the shear size of the data, pd.to_csv() is not the way to go.

    Here, I build on the previous solution by computing the distance matrix, note that the matrix actually is sparse since the distance between xand y is the same as the distance betweeen y and x (so the matrix is triangular.

    I create the dataframe by adding the points (lat, long). Then, using polar I tranform the pandas dataframe to a polar dataframe and finally save it:

    import numpy as np
    import pandas as pd
    import time
    
    def haversine(lat1, lon1, lat2, lon2):
        lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])
        dlon = lon2 - lon1 
        dlat = lat2 - lat1 
        a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
        c = 2 * np.arcsin(np.sqrt(a)) 
        r = 6371  
        return c * r
    
    np.random.seed(42)
    latitudes = np.random.uniform(low=25.0, high=49.0, size=10000)
    longitudes = np.random.uniform(low=-125.0, high=-66.0, size=10000)
    coords = np.column_stack((latitudes, longitudes))
    
    start_time = time.time()
    
    n = coords.shape[0]
    dist_matrix = np.zeros((n, n))
    for i in range(n):
        dist_matrix[i, :] = haversine(coords[i, 0], coords[i, 1], coords[:, 0], coords[:, 1])
    
    end_time = time.time()
    elapsed_time = end_time - start_time
    print("Distance matrix computed in {:.2f} seconds".format(elapsed_time))
    print("Shape of the distance matrix:", dist_matrix.shape)
    
    point_labels = [f"Point {i}" for i in range(n)]
    dist_df = pd.DataFrame(dist_matrix, index=point_labels, columns=point_labels)
    
    
    distance = dist_df.loc['Point 0', 'Point 1']
    print(f"Distance from Point 0 to Point 1: {distance} km")
    
    lat_lon_index = pd.MultiIndex.from_arrays([latitudes, longitudes], names=['Latitude', 'Longitude'])
    dist_df.index = lat_lon_index
    dist_df.columns = lat_lon_index
    
    lat_from, lon_from = latitudes[0], longitudes[0]  
    lat_to, lon_to = latitudes[1], longitudes[1]
    distance = dist_df.loc[(lat_from, lon_from), (lat_to, lon_to)]
    print(f"Distance from ({lat_from}, {lon_from}) to ({lat_to}, {lon_to}): {distance} km")
    

    Now the dist_df looks like this

    enter image description here

    To save it to a csv just do this:

    dist_df_reset = dist_df.reset_index()
    
    polars_df = pl.from_pandas(dist_df_reset)
    
    polars_df.write_csv('polars_distance_matrix.csv')
    

    The file will be:

    enter image description here

    IMPORTANT NOTE: This will be much faster, but you cannot hope for it to take very little time with that many points