I have two dataframes.
df1 has 580 unique records - with latitude and longitude information
df2 has 490000 unique records - with latitude and longitude information
I am trying to get - out of these 580 locations, how many of them are present within a radius of 400 meters of 490000 locations.
I am using the following code and it is working.
from __future__ import print_function
from config import conn
from pandas import DataFrame
import pandas as pd
import math
def distance(origin, destination):
lat1, lon1 = origin
lat2, lon2 = destination
radius = 6371 *1000# km
dlat = math.radians(lat2-lat1)
dlon = math.radians(lon2-lon1)
a = math.sin(dlat/2) * math.sin(dlat/2) + math.cos(math.radians(lat1)) \
* math.cos(math.radians(lat2)) * math.sin(dlon/2) * math.sin(dlon/2)
c = 2 * math.atan2(math.sqrt(a), math.sqrt(1-a))
d = radius * c
return d
def convertTuple(tup):
str = ''.join(tup)
return str
df1 = pd.read_csv("/home/ubuntu/maid80.csv")
df2 = pd.read_csv("/home/ubuntu/iodr.csv")
ll = []
for index,rows in df2.iterrows():
lat1 = rows['latitude']
lon1 = rows['longitude']
for i,r in df1.iterrows():
k = distance((lat1,lon1),(r['latitude'],r['longitude']))
if (k <= 400):
ll.append(rows['id'])
# print(ll)
print(index)
myset = set(ll)
print(myset)
I am running this out of my laptop and it is taking more than 2 hours to complete all 580 iterations. My worry is the number of records in the second dataset is going to swell.
Is there a better way to do this, so that I can save time.
You may try this using geopandas :
import geopandas as gpd
import pandas as pd
import pyproj
df1 = pd.read_csv("/home/ubuntu/maid80.csv")
df2 = pd.read_csv("/home/ubuntu/iodr.csv")
gdf1 = gpd.GeoDataFrame(df1, geometry=gpd.points_from_xy(df1['longitude'], df1['latitude']), crs=pyproj.CRS.from_epsg(4326))
gdf2 = gpd.GeoDataFrame(df2, geometry=gpd.points_from_xy(df2['longitude'], df2['latitude']), crs=pyproj.CRS.from_epsg(4326))
radius = 400
for gdf in [gdf1, gdf2]:
gdf.to_crs(pyproj.CRS.from_epsg(3857), inplace=True)
gdf1['geometry'] = gdf1['geometry'].buffer(radius)
gdf2['IS_WITHIN_400M'] = 1
gdf = gpd.sjoin(gdf1, gdf2['geometry'], how='left')
print(gdf[gdf.IS_WITHIN_400M_right==1].head())
Some explanations :
Geopandas will allow you to use a GeoDataFrame, on which you can "buffer" your points using a radius (really fast). The points_from_xy function is also pretty fast and will allow you to construct those objects efficiently.
The sjoin method (stands for spatial join) is also fast. I suspect this as something to do with algorithm including bounding boxes and/or sorting coordinates... I've had some good results using this method.
Warning :
I projected the datasets into EPSG 3857, which is global AND has cartesian coordinates (in meters). Regarding to any "real" project, you have to chose the projection carefully (ie chose the best "metric-system-friendly" projection in your area) to avoid any distorsion of the buffer...