python memory runtime vectorization geopandas

Need to compute distances between each of 60,000 coordinates

I am working on a correlative study in Python that needs a matrix of distances between each pair of coordinates in a data set with 60,000 data points. I have attempted vectorization, and using geopandas, but the problem with geopandas is that to run the distance function, I need the x and y data in repeating lists (the x data repeating the set of 60,000 towers and the y repeating each coordinate 60,000 times in a row) making each list 3.6e9 values long, and my computer runs out of memory before this is completed, or when I try to run it on a remote desktop at my school, it takes more than half an hour and I haven't been able to run it successfully. Here is the code I am running:

#Florida Tower Matrix 
#take coordinates of Florida towers
#CHECK THE LAT/LONG order 
import geojson
import pandas as pd
import geopandas as gpd
from shapely.geometry import Point
features = []
with open("/Users/katcn/Desktop/Spring 2024/Research/PleaseWork.geojson") as f:
   gj = geojson.load(f)

for i in range(59629):
   features.append(gj['features'][i]["geometry"]['coordinates'])


#OR make the X matrix all in one column 
#make the Y matrix repeat each value 59000 times 
longitude = []
latitude = []
for i in range(len(features)):
   for j in range(len(features)):
       longitude.append(features[j][0])
   for k in range(len(features)):
       latitude.append(features[i][0])


dict = {"longitude" : longitude, "latitude" : latitude}
df = pd.DataFrame(dict)
dict2 = {"longitude" : longitude, "latitude" : latitude}
df2 = pd.DataFrame(dict2)
#calculate distance between two towers 
geometry = [Point(xy) for xy in zip(df.longitude, df.latitude)]
gdf = gpd.GeoDataFrame(df, crs={'init': 'epsg:4326'}, geometry=geometry)

geometry2 = [Point(xy) for xy in zip(df2.longitude, df2.latitude)]
gdf2 = gpd.GeoDataFrame(df2, crs={'init': 'epsg:4326'}, geometry=geometry2)

distances = gdf.geometry.distance(gdf2.geometry)

print(distances)

Any suggestions for how to approach this problem differently to make it a more reasonable run-time would be great.

Solution

This answer is to solve the issue that immediately follows, namely saving the results you obtain to a csv-file. Given the shear size of the data, pd.to_csv() is not the way to go.

Here, I build on the previous solution by computing the distance matrix, note that the matrix actually is sparse since the distance between xand y is the same as the distance betweeen y and x (so the matrix is triangular.

I create the dataframe by adding the points (lat, long). Then, using polar I tranform the pandas dataframe to a polar dataframe and finally save it:

import numpy as np
import pandas as pd
import time

def haversine(lat1, lon1, lat2, lon2):
    lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])
    dlon = lon2 - lon1 
    dlat = lat2 - lat1 
    a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
    c = 2 * np.arcsin(np.sqrt(a)) 
    r = 6371  
    return c * r

np.random.seed(42)
latitudes = np.random.uniform(low=25.0, high=49.0, size=10000)
longitudes = np.random.uniform(low=-125.0, high=-66.0, size=10000)
coords = np.column_stack((latitudes, longitudes))

start_time = time.time()

n = coords.shape[0]
dist_matrix = np.zeros((n, n))
for i in range(n):
    dist_matrix[i, :] = haversine(coords[i, 0], coords[i, 1], coords[:, 0], coords[:, 1])

end_time = time.time()
elapsed_time = end_time - start_time
print("Distance matrix computed in {:.2f} seconds".format(elapsed_time))
print("Shape of the distance matrix:", dist_matrix.shape)

point_labels = [f"Point {i}" for i in range(n)]
dist_df = pd.DataFrame(dist_matrix, index=point_labels, columns=point_labels)


distance = dist_df.loc['Point 0', 'Point 1']
print(f"Distance from Point 0 to Point 1: {distance} km")

lat_lon_index = pd.MultiIndex.from_arrays([latitudes, longitudes], names=['Latitude', 'Longitude'])
dist_df.index = lat_lon_index
dist_df.columns = lat_lon_index

lat_from, lon_from = latitudes[0], longitudes[0]  
lat_to, lon_to = latitudes[1], longitudes[1]
distance = dist_df.loc[(lat_from, lon_from), (lat_to, lon_to)]
print(f"Distance from ({lat_from}, {lon_from}) to ({lat_to}, {lon_to}): {distance} km")

Now the dist_df looks like this

To save it to a csv just do this:

dist_df_reset = dist_df.reset_index()

polars_df = pl.from_pandas(dist_df_reset)

polars_df.write_csv('polars_distance_matrix.csv')

The file will be:

IMPORTANT NOTE: This will be much faster, but you cannot hope for it to take very little time with that many points