so, i have a dataframe with columns latitude and longitude that are float, along with that there are columns borough, on street name, cross street name, off street name which are a string containing a location. the null values are as follows
i want to fill in the null values of latitude and longitude. so i want to use GEOPANDAS to find the associated location's lat and long to fill in the missing values. i am having trouble coming up with the code.
essentially the code is supposed to
i have tried
location_fill = location _group.loc[(location_group['LATITUDE'].isna()) | (location_group['LATITUDE'] == 0.0)]
to_fill.dropna(subset=['BOROUGH', 'ON STREET NAME', 'CROSS STREET NAME', 'OFF STREET NAME'], how='all')
but i have no idea how to implement the rest of the code
You can install geocoder
to get coordinates. Try something like:
import pandas as pd
import geocoder # pip install geocoder
# Read data (only first 1000 rows)
df = pd.read_csv('NYPD_Motor_Vehicle_Collisions.csv', low_memory=False, nrows=1000)
# Clean names
cols = ['ON STREET NAME', 'CROSS STREET NAME', 'OFF STREET NAME', 'BOROUGH']
for col in cols:
df[col] = df[col].str.replace(r'\s+', ' ', regex=True).str.strip()
# Remove bad records
df = df.dropna(subset=cols, how='all')
# Find locations for these records
df1 = df[df['LATITUDE'].isna() | df['LATITUDE'].eq(0)]
# Determine what you want to keep to find coordinates
locs = df1[cols].bfill(axis=1).iloc[:, 0].rename('OSM')
# Find coordinates
coords = {}
for loc in locs.drop_duplicates():
print(loc)
g = geocoder.osm(f'{loc}, New York')
if g:
coords[loc] = {'LATITUDE': g.osm['y'],
'LONGITUDE': g.osm['x'],
'LOCATION': f"({g.osm['y']}, {g.osm['x']})"}
osm = pd.DataFrame.from_dict(coords, orient='index')
# Update missing values
out = (locs.reset_index().merge(osm, left_on='OSM', right_index=True)
.set_index('index').drop(columns='OSM'))
df.loc[out.index, ['LATITUDE', 'LONGITUDE', 'LOCATION']] = out
Output:
>>> df
DATE TIME BOROUGH ZIP CODE LATITUDE LONGITUDE ... UNIQUE KEY VEHICLE TYPE CODE 1 VEHICLE TYPE CODE 2 VEHICLE TYPE CODE 3 VEHICLE TYPE CODE 4 VEHICLE TYPE CODE 5
0 08/04/2017 0:00 QUEENS 11436.0 40.666885 -73.790405 ... 3725017 PASSENGER VEHICLE PASSENGER VEHICLE NaN NaN NaN
1 08/04/2017 0:00 NaN NaN 40.719950 -74.008590 ... 3725047 PICK-UP TRUCK SPORT UTILITY / STATION WAGON NaN NaN NaN
2 08/04/2017 0:00 NaN NaN 40.718666 -73.963500 ... 3725533 SPORT UTILITY / STATION WAGON PASSENGER VEHICLE NaN NaN NaN
3 08/04/2017 0:00 NaN NaN 40.754677 -73.975815 ... 3724870 SPORT UTILITY / STATION WAGON TAXI NaN NaN NaN
7 08/04/2017 0:00 BRONX 10460.0 40.836770 -73.885970 ... 3725174 PASSENGER VEHICLE NaN NaN NaN NaN
.. ... ... ... ... ... ... ... ... ... ... ... ... ...
995 08/03/2017 20:15 NaN NaN 40.741024 -73.978620 ... 3724516 TAXI MOTORCYCLE NaN NaN NaN
996 08/03/2017 20:15 NaN NaN 40.774315 -73.961611 ... 3724248 SPORT UTILITY / STATION WAGON PASSENGER VEHICLE NaN NaN NaN
997 08/03/2017 20:20 MANHATTAN 10029.0 40.798622 -73.941630 ... 3724948 PASSENGER VEHICLE NaN NaN NaN NaN
998 08/03/2017 20:20 QUEENS 11379.0 40.725750 -73.877560 ... 3724699 SPORT UTILITY / STATION WAGON SPORT UTILITY / STATION WAGON NaN NaN NaN
999 08/03/2017 20:30 BROOKLYN 11201.0 40.702534 -74.013985 ... 3724240 PASSENGER VEHICLE PASSENGER VEHICLE NaN NaN NaN
[888 rows x 29 columns]
Before update:
>>> (df['LATITUDE'].isna() | df['LATITUDE'].eq(0)).sum()
70
After update:
>>> (df['LATITUDE'].isna() | df['LATITUDE'].eq(0)).sum()
9