I am working on a correlative study in Python that needs a matrix of distances between each pair of coordinates in a data set with 60,000 data points. I have attempted vectorization, and using geopandas, but the problem with geopandas is that to run the distance function, I need the x and y data in repeating lists (the x data repeating the set of 60,000 towers and the y repeating each coordinate 60,000 times in a row) making each list 3.6e9 values long, and my computer runs out of memory before this is completed, or when I try to run it on a remote desktop at my school, it takes more than half an hour and I haven't been able to run it successfully. Here is the code I am running:
#Florida Tower Matrix
#take coordinates of Florida towers
#CHECK THE LAT/LONG order
import geojson
import pandas as pd
import geopandas as gpd
from shapely.geometry import Point
features = []
with open("/Users/katcn/Desktop/Spring 2024/Research/PleaseWork.geojson") as f:
gj = geojson.load(f)
for i in range(59629):
features.append(gj['features'][i]["geometry"]['coordinates'])
#OR make the X matrix all in one column
#make the Y matrix repeat each value 59000 times
longitude = []
latitude = []
for i in range(len(features)):
for j in range(len(features)):
longitude.append(features[j][0])
for k in range(len(features)):
latitude.append(features[i][0])
dict = {"longitude" : longitude, "latitude" : latitude}
df = pd.DataFrame(dict)
dict2 = {"longitude" : longitude, "latitude" : latitude}
df2 = pd.DataFrame(dict2)
#calculate distance between two towers
geometry = [Point(xy) for xy in zip(df.longitude, df.latitude)]
gdf = gpd.GeoDataFrame(df, crs={'init': 'epsg:4326'}, geometry=geometry)
geometry2 = [Point(xy) for xy in zip(df2.longitude, df2.latitude)]
gdf2 = gpd.GeoDataFrame(df2, crs={'init': 'epsg:4326'}, geometry=geometry2)
distances = gdf.geometry.distance(gdf2.geometry)
print(distances)
Any suggestions for how to approach this problem differently to make it a more reasonable run-time would be great.
This answer is to solve the issue that immediately follows, namely saving the results you obtain to a csv-file. Given the shear size of the data, pd.to_csv()
is not the way to go.
Here, I build on the previous solution by computing the distance matrix, note that the matrix actually is sparse since the distance between x
and y
is the same as the distance betweeen y
and x
(so the matrix is triangular.
I create the dataframe by adding the points (lat
, long
). Then, using polar
I tranform the pandas dataframe to a polar dataframe and finally save it:
import numpy as np
import pandas as pd
import time
def haversine(lat1, lon1, lat2, lon2):
lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])
dlon = lon2 - lon1
dlat = lat2 - lat1
a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
c = 2 * np.arcsin(np.sqrt(a))
r = 6371
return c * r
np.random.seed(42)
latitudes = np.random.uniform(low=25.0, high=49.0, size=10000)
longitudes = np.random.uniform(low=-125.0, high=-66.0, size=10000)
coords = np.column_stack((latitudes, longitudes))
start_time = time.time()
n = coords.shape[0]
dist_matrix = np.zeros((n, n))
for i in range(n):
dist_matrix[i, :] = haversine(coords[i, 0], coords[i, 1], coords[:, 0], coords[:, 1])
end_time = time.time()
elapsed_time = end_time - start_time
print("Distance matrix computed in {:.2f} seconds".format(elapsed_time))
print("Shape of the distance matrix:", dist_matrix.shape)
point_labels = [f"Point {i}" for i in range(n)]
dist_df = pd.DataFrame(dist_matrix, index=point_labels, columns=point_labels)
distance = dist_df.loc['Point 0', 'Point 1']
print(f"Distance from Point 0 to Point 1: {distance} km")
lat_lon_index = pd.MultiIndex.from_arrays([latitudes, longitudes], names=['Latitude', 'Longitude'])
dist_df.index = lat_lon_index
dist_df.columns = lat_lon_index
lat_from, lon_from = latitudes[0], longitudes[0]
lat_to, lon_to = latitudes[1], longitudes[1]
distance = dist_df.loc[(lat_from, lon_from), (lat_to, lon_to)]
print(f"Distance from ({lat_from}, {lon_from}) to ({lat_to}, {lon_to}): {distance} km")
Now the dist_df
looks like this
To save it to a csv just do this:
dist_df_reset = dist_df.reset_index()
polars_df = pl.from_pandas(dist_df_reset)
polars_df.write_csv('polars_distance_matrix.csv')
The file will be:
IMPORTANT NOTE: This will be much faster, but you cannot hope for it to take very little time with that many points