Search code examples
python-3.xpandasscikit-learnk-meansdbscan

Fill missing values based on spatial clustering method in Python


Given a dataframe as follows:

     latitude   longitude         user_service
0  -27.496404  153.014353      02: Duhig Tower
1  -27.497107  153.014836                  NaN
2  -27.497118  153.014890                  NaN
3  -27.497154  153.014813                  NaN
4  -27.496437  153.014477      12: Duhig North
5  -27.497156  153.014813  32: Gordon Greenwod
6  -27.497097  153.014746       23: Abel Smith
7  -27.496390  153.014415  32: Gordon Greenwod
8  -27.497112  153.014780           03: Steele
9  -27.497156  153.014813  32: Gordon Greenwod
10 -27.496487  153.014622      02: Duhig Tower
11 -27.497075  153.014532                  NaN
12 -27.497103  153.014817        25: UQ Sports
13 -27.496754  153.014504      02: Duhig Tower
14 -27.496567  153.014294      02: Duhig Tower
15 -27.497156  153.014813  32: Gordon Greenwod

Since user_service column has missing values, so I think maybe I could use spatial clustering method to fill nans.

For example, for the latitude and longitude pair -27.497107, 153.014836 in the second row, if 02: Duhig Tower's location is most closest one to it by distance, so I would like to fill nan in user_service with 02: Duhig Tower for this row. Same logic for the other missing rows.

How could I implement the logic above in Python? Thanks.

The output from Guillermo Mosse's solution, but there are still some NaNs:

     latitude   longitude                   user_service
0  -27.499012  153.015180               51: Zelman Cowen
1  -27.497600  153.014479                     03: Steele
2  -27.500054  153.013435         50: Hawken Engineering
3  -27.495979  153.009834                            NaN
4  -27.496748  153.017507            32: Gordon Greenwod
5  -27.495695  153.016178  38: UQ Multi Faith Chaplaincy
6  -27.497015  153.012492               01: Forgan Smith
7  -27.498797  153.017267                            NaN
8  -27.500508  153.011360                       75: AIBN
9  -27.496763  153.013795               01: Forgan Smith
10 -27.494909  153.017187                            NaN
11 -27.496384  153.013810                12: Duhig North

Check NaNs:

var = df.loc[[2]].user_service
print(var)
print(type(var))
print(len(var))

Out:

2    NaN
Name: user_service, dtype: object
<class 'pandas.core.series.Series'>
1

Solution

  • Ideally, you would want to use Pandas' interpolate with a custom distance function to fill NaN values, but the method doesn't seem to be extendable in any way.

    A possible solution is to, for each datapoint, get the service_name of the closest data point that actually has a service_name. Here's a full working example of a possible solution:

    import pandas as pd
    from scipy.spatial.distance import cdist
    import numpy as np
    
    df = pd.DataFrame    ([
      [-27.496404,  153.014353,      "02: Duhig Tower"],
      [-27.497107,  153.014836,                  None],
      [-27.497118,  153.014890,                  None],
      [-27.497154,  153.014813,                  None],
      [-27.496437,  153.014477,      "12: Duhig North"],
      [-27.497156,  153.014813,  "32: Gordon Greenwod"],
      [-27.497097,  153.014746,       "23: Abel Smith"],
      [-27.496390,  153.014415,  "32: Gordon Greenwod"],
      [-27.497112,  153.014780,           "03: Steele"],
      [-27.497156,  153.014813,  "32: Gordon Greenwod"],
      [-27.496487,  153.014622,      "02: Duhig Tower"],
      [-27.497075,  153.014532,                  None],
      [-27.497103,  153.014817,        "25: UQ Sports"],
      [-27.496754,  153.014504,      "02: Duhig Tower"],
      [-27.496567,  153.014294,      "02: Duhig Tower"],
      [-27.497156,  153.014813,  "32: Gordon Greenwod"]],
        columns = ["latitude", "longitude", "user_service"])
    
    
    def closest_point_service_name(point, points, user_services):
        """ Find closest point with non null user_service """
    
        #First we filter the points and user_services by the ones that don't have null user_service
        points = points[user_services != None]
        user_services = user_services[user_services != None]
    
        #we use cdist to get all distances between pairs of points
        distances = cdist([point], points)[0]
    
        #we don't want to consider the current point
        distances[distances == 0] = np.inf
    
        #we get the index of the closest point
        closest_point_index = distances.argmin()
    
        #we return the user_service of the closest point that has a user_service
        closest_point_user_service = user_services[closest_point_index]
        return closest_point_user_service
    
    #we convert the lat and long to a pair
    df['point'] = [(x, y) for x,y in zip(df['latitude'], df['longitude'])]
    
    #we create the additional column
    df['closest'] = [closest_point_service_name(x, np.asarray(list(df['point'])), np.asarray(list(df['user_service']))) for x in df['point']]
    
    #finally, we fill nulls
    df.user_service = df.user_service.fillna(df['closest'])
    
    del df['closest']
    
    df
    

    This is the output:

    latitude    longitude   user_service    point
    0   -27.496404  153.014353  02: Duhig Tower     (-27.496404, 153.014353)
    1   -27.497107  153.014836  25: UQ Sports   (-27.497107, 153.014836)
    2   -27.497118  153.014890  25: UQ Sports   (-27.497118, 153.01489)
    3   -27.497154  153.014813  32: Gordon Greenwod     (-27.497154, 153.014813)
    4   -27.496437  153.014477  12: Duhig North     (-27.496437, 153.014477)
    5   -27.497156  153.014813  32: Gordon Greenwod     (-27.497156, 153.014813)
    6   -27.497097  153.014746  23: Abel Smith  (-27.497097, 153.014746)
    7   -27.496390  153.014415  32: Gordon Greenwod     (-27.49639, 153.014415)
    8   -27.497112  153.014780  03: Steele  (-27.497112, 153.01478)
    9   -27.497156  153.014813  32: Gordon Greenwod     (-27.497156, 153.014813)
    10  -27.496487  153.014622  02: Duhig Tower     (-27.496487, 153.014622)
    11  -27.497075  153.014532  23: Abel Smith  (-27.497075, 153.014532)
    12  -27.497103  153.014817  25: UQ Sports   (-27.497103, 153.014817)
    13  -27.496754  153.014504  02: Duhig Tower     (-27.496754, 153.014504)
    14  -27.496567  153.014294  02: Duhig Tower     (-27.496567, 153.014294)
    15  -27.497156  153.014813  32: Gordon Greenwod     (-27.497156, 153.014813)