filling in null values of location in a dataframe

data set in question

so, i have a dataframe with columns latitude and longitude that are float, along with that there are columns borough, on street name, cross street name, off street name which are a string containing a location. the null values are as follows

BOROUGH 598045
LATITUDE : 199705
LONGITUDE : 199705
ON STREET NAME : 394986
CROSS STREET NAME : 725867
OFF STREET NAME : 1656305

i want to fill in the null values of latitude and longitude. so i want to use GEOPANDAS to find the associated location's lat and long to fill in the missing values. i am having trouble coming up with the code.

essentially the code is supposed to

check the columns BOROUGH,ON STREET NAME,CROSS STREET NAME,OFF STREET NAME if all are null drop them
then check if the latitude and/or latitude are null (also select the instances where lat and long are 0.0)
then check the columns BOROUGH, ON STREET, CROSS STREET and OFF STREET (in this order) get their location
and fill in the latitude and location

i have tried

location_fill = location _group.loc[(location_group['LATITUDE'].isna()) | (location_group['LATITUDE'] == 0.0)]
to_fill.dropna(subset=['BOROUGH', 'ON STREET NAME', 'CROSS STREET NAME', 'OFF STREET NAME'], how='all')

but i have no idea how to implement the rest of the code

Solution

You can install geocoder to get coordinates. Try something like:

import pandas as pd
import geocoder  # pip install geocoder

# Read data (only first 1000 rows)
df = pd.read_csv('NYPD_Motor_Vehicle_Collisions.csv', low_memory=False, nrows=1000)

# Clean names
cols = ['ON STREET NAME', 'CROSS STREET NAME', 'OFF STREET NAME', 'BOROUGH']
for col in cols:
    df[col] = df[col].str.replace(r'\s+', ' ', regex=True).str.strip()

# Remove bad records
df = df.dropna(subset=cols, how='all')

# Find locations for these records
df1 = df[df['LATITUDE'].isna() | df['LATITUDE'].eq(0)]

# Determine what you want to keep to find coordinates
locs = df1[cols].bfill(axis=1).iloc[:, 0].rename('OSM')

# Find coordinates
coords = {}
for loc in locs.drop_duplicates():
    print(loc)
    g = geocoder.osm(f'{loc}, New York')
    if g:
        coords[loc] = {'LATITUDE': g.osm['y'],
                       'LONGITUDE': g.osm['x'],
                       'LOCATION': f"({g.osm['y']}, {g.osm['x']})"}

osm = pd.DataFrame.from_dict(coords, orient='index')

# Update missing values
out = (locs.reset_index().merge(osm, left_on='OSM', right_index=True)
           .set_index('index').drop(columns='OSM'))
df.loc[out.index, ['LATITUDE', 'LONGITUDE', 'LOCATION']] = out

Output:

>>> df
           DATE   TIME    BOROUGH  ZIP CODE   LATITUDE  LONGITUDE  ... UNIQUE KEY            VEHICLE TYPE CODE 1            VEHICLE TYPE CODE 2 VEHICLE TYPE CODE 3  VEHICLE TYPE CODE 4  VEHICLE TYPE CODE 5
0    08/04/2017   0:00     QUEENS   11436.0  40.666885 -73.790405  ...    3725017              PASSENGER VEHICLE              PASSENGER VEHICLE                 NaN                  NaN                  NaN
1    08/04/2017   0:00        NaN       NaN  40.719950 -74.008590  ...    3725047                  PICK-UP TRUCK  SPORT UTILITY / STATION WAGON                 NaN                  NaN                  NaN
2    08/04/2017   0:00        NaN       NaN  40.718666 -73.963500  ...    3725533  SPORT UTILITY / STATION WAGON              PASSENGER VEHICLE                 NaN                  NaN                  NaN
3    08/04/2017   0:00        NaN       NaN  40.754677 -73.975815  ...    3724870  SPORT UTILITY / STATION WAGON                           TAXI                 NaN                  NaN                  NaN
7    08/04/2017   0:00      BRONX   10460.0  40.836770 -73.885970  ...    3725174              PASSENGER VEHICLE                            NaN                 NaN                  NaN                  NaN
..          ...    ...        ...       ...        ...        ...  ...        ...                            ...                            ...                 ...                  ...                  ...
995  08/03/2017  20:15        NaN       NaN  40.741024 -73.978620  ...    3724516                           TAXI                     MOTORCYCLE                 NaN                  NaN                  NaN
996  08/03/2017  20:15        NaN       NaN  40.774315 -73.961611  ...    3724248  SPORT UTILITY / STATION WAGON              PASSENGER VEHICLE                 NaN                  NaN                  NaN
997  08/03/2017  20:20  MANHATTAN   10029.0  40.798622 -73.941630  ...    3724948              PASSENGER VEHICLE                            NaN                 NaN                  NaN                  NaN
998  08/03/2017  20:20     QUEENS   11379.0  40.725750 -73.877560  ...    3724699  SPORT UTILITY / STATION WAGON  SPORT UTILITY / STATION WAGON                 NaN                  NaN                  NaN
999  08/03/2017  20:30   BROOKLYN   11201.0  40.702534 -74.013985  ...    3724240              PASSENGER VEHICLE              PASSENGER VEHICLE                 NaN                  NaN                  NaN

[888 rows x 29 columns]

Before update:

>>> (df['LATITUDE'].isna() | df['LATITUDE'].eq(0)).sum()
70

After update:

>>> (df['LATITUDE'].isna() | df['LATITUDE'].eq(0)).sum()
9