Search code examples
pandasdataframegeopandas

filling in null values of location in a dataframe


data set in question

so, i have a dataframe with columns latitude and longitude that are float, along with that there are columns borough, on street name, cross street name, off street name which are a string containing a location. the null values are as follows

  • BOROUGH 598045
  • LATITUDE : 199705
  • LONGITUDE : 199705
  • ON STREET NAME : 394986
  • CROSS STREET NAME : 725867
  • OFF STREET NAME : 1656305

i want to fill in the null values of latitude and longitude. so i want to use GEOPANDAS to find the associated location's lat and long to fill in the missing values. i am having trouble coming up with the code.

essentially the code is supposed to

  1. check the columns BOROUGH,ON STREET NAME,CROSS STREET NAME,OFF STREET NAME if all are null drop them
  2. then check if the latitude and/or latitude are null (also select the instances where lat and long are 0.0)
  3. then check the columns BOROUGH, ON STREET, CROSS STREET and OFF STREET (in this order) get their location
  4. and fill in the latitude and location

i have tried

location_fill = location _group.loc[(location_group['LATITUDE'].isna()) | (location_group['LATITUDE'] == 0.0)]
to_fill.dropna(subset=['BOROUGH', 'ON STREET NAME', 'CROSS STREET NAME', 'OFF STREET NAME'], how='all')

but i have no idea how to implement the rest of the code


Solution

  • You can install geocoder to get coordinates. Try something like:

    import pandas as pd
    import geocoder  # pip install geocoder
    
    # Read data (only first 1000 rows)
    df = pd.read_csv('NYPD_Motor_Vehicle_Collisions.csv', low_memory=False, nrows=1000)
    
    # Clean names
    cols = ['ON STREET NAME', 'CROSS STREET NAME', 'OFF STREET NAME', 'BOROUGH']
    for col in cols:
        df[col] = df[col].str.replace(r'\s+', ' ', regex=True).str.strip()
    
    # Remove bad records
    df = df.dropna(subset=cols, how='all')
    
    # Find locations for these records
    df1 = df[df['LATITUDE'].isna() | df['LATITUDE'].eq(0)]
    
    # Determine what you want to keep to find coordinates
    locs = df1[cols].bfill(axis=1).iloc[:, 0].rename('OSM')
    
    # Find coordinates
    coords = {}
    for loc in locs.drop_duplicates():
        print(loc)
        g = geocoder.osm(f'{loc}, New York')
        if g:
            coords[loc] = {'LATITUDE': g.osm['y'],
                           'LONGITUDE': g.osm['x'],
                           'LOCATION': f"({g.osm['y']}, {g.osm['x']})"}
    
    osm = pd.DataFrame.from_dict(coords, orient='index')
    
    # Update missing values
    out = (locs.reset_index().merge(osm, left_on='OSM', right_index=True)
               .set_index('index').drop(columns='OSM'))
    df.loc[out.index, ['LATITUDE', 'LONGITUDE', 'LOCATION']] = out
    

    Output:

    >>> df
               DATE   TIME    BOROUGH  ZIP CODE   LATITUDE  LONGITUDE  ... UNIQUE KEY            VEHICLE TYPE CODE 1            VEHICLE TYPE CODE 2 VEHICLE TYPE CODE 3  VEHICLE TYPE CODE 4  VEHICLE TYPE CODE 5
    0    08/04/2017   0:00     QUEENS   11436.0  40.666885 -73.790405  ...    3725017              PASSENGER VEHICLE              PASSENGER VEHICLE                 NaN                  NaN                  NaN
    1    08/04/2017   0:00        NaN       NaN  40.719950 -74.008590  ...    3725047                  PICK-UP TRUCK  SPORT UTILITY / STATION WAGON                 NaN                  NaN                  NaN
    2    08/04/2017   0:00        NaN       NaN  40.718666 -73.963500  ...    3725533  SPORT UTILITY / STATION WAGON              PASSENGER VEHICLE                 NaN                  NaN                  NaN
    3    08/04/2017   0:00        NaN       NaN  40.754677 -73.975815  ...    3724870  SPORT UTILITY / STATION WAGON                           TAXI                 NaN                  NaN                  NaN
    7    08/04/2017   0:00      BRONX   10460.0  40.836770 -73.885970  ...    3725174              PASSENGER VEHICLE                            NaN                 NaN                  NaN                  NaN
    ..          ...    ...        ...       ...        ...        ...  ...        ...                            ...                            ...                 ...                  ...                  ...
    995  08/03/2017  20:15        NaN       NaN  40.741024 -73.978620  ...    3724516                           TAXI                     MOTORCYCLE                 NaN                  NaN                  NaN
    996  08/03/2017  20:15        NaN       NaN  40.774315 -73.961611  ...    3724248  SPORT UTILITY / STATION WAGON              PASSENGER VEHICLE                 NaN                  NaN                  NaN
    997  08/03/2017  20:20  MANHATTAN   10029.0  40.798622 -73.941630  ...    3724948              PASSENGER VEHICLE                            NaN                 NaN                  NaN                  NaN
    998  08/03/2017  20:20     QUEENS   11379.0  40.725750 -73.877560  ...    3724699  SPORT UTILITY / STATION WAGON  SPORT UTILITY / STATION WAGON                 NaN                  NaN                  NaN
    999  08/03/2017  20:30   BROOKLYN   11201.0  40.702534 -74.013985  ...    3724240              PASSENGER VEHICLE              PASSENGER VEHICLE                 NaN                  NaN                  NaN
    
    [888 rows x 29 columns]
    

    Before update:

    >>> (df['LATITUDE'].isna() | df['LATITUDE'].eq(0)).sum()
    70
    

    After update:

    >>> (df['LATITUDE'].isna() | df['LATITUDE'].eq(0)).sum()
    9