Given a dataframe as follows:
latitude longitude user_service
0 -27.496404 153.014353 02: Duhig Tower
1 -27.497107 153.014836 NaN
2 -27.497118 153.014890 NaN
3 -27.497154 153.014813 NaN
4 -27.496437 153.014477 12: Duhig North
5 -27.497156 153.014813 32: Gordon Greenwod
6 -27.497097 153.014746 23: Abel Smith
7 -27.496390 153.014415 32: Gordon Greenwod
8 -27.497112 153.014780 03: Steele
9 -27.497156 153.014813 32: Gordon Greenwod
10 -27.496487 153.014622 02: Duhig Tower
11 -27.497075 153.014532 NaN
12 -27.497103 153.014817 25: UQ Sports
13 -27.496754 153.014504 02: Duhig Tower
14 -27.496567 153.014294 02: Duhig Tower
15 -27.497156 153.014813 32: Gordon Greenwod
Since user_service
column has missing values, so I think maybe I could use spatial clustering method to fill nans.
For example, for the latitude
and longitude
pair -27.497107, 153.014836
in the second row, if 02: Duhig Tower
's location is most closest one to it by distance, so I would like to fill nan in user_service
with 02: Duhig Tower
for this row. Same logic for the other missing rows.
How could I implement the logic above in Python? Thanks.
The output from Guillermo Mosse's solution, but there are still some NaN
s:
latitude longitude user_service
0 -27.499012 153.015180 51: Zelman Cowen
1 -27.497600 153.014479 03: Steele
2 -27.500054 153.013435 50: Hawken Engineering
3 -27.495979 153.009834 NaN
4 -27.496748 153.017507 32: Gordon Greenwod
5 -27.495695 153.016178 38: UQ Multi Faith Chaplaincy
6 -27.497015 153.012492 01: Forgan Smith
7 -27.498797 153.017267 NaN
8 -27.500508 153.011360 75: AIBN
9 -27.496763 153.013795 01: Forgan Smith
10 -27.494909 153.017187 NaN
11 -27.496384 153.013810 12: Duhig North
Check NaN
s:
var = df.loc[[2]].user_service
print(var)
print(type(var))
print(len(var))
Out:
2 NaN
Name: user_service, dtype: object
<class 'pandas.core.series.Series'>
1
Ideally, you would want to use Pandas' interpolate with a custom distance function to fill NaN values, but the method doesn't seem to be extendable in any way.
A possible solution is to, for each datapoint, get the service_name of the closest data point that actually has a service_name. Here's a full working example of a possible solution:
import pandas as pd
from scipy.spatial.distance import cdist
import numpy as np
df = pd.DataFrame ([
[-27.496404, 153.014353, "02: Duhig Tower"],
[-27.497107, 153.014836, None],
[-27.497118, 153.014890, None],
[-27.497154, 153.014813, None],
[-27.496437, 153.014477, "12: Duhig North"],
[-27.497156, 153.014813, "32: Gordon Greenwod"],
[-27.497097, 153.014746, "23: Abel Smith"],
[-27.496390, 153.014415, "32: Gordon Greenwod"],
[-27.497112, 153.014780, "03: Steele"],
[-27.497156, 153.014813, "32: Gordon Greenwod"],
[-27.496487, 153.014622, "02: Duhig Tower"],
[-27.497075, 153.014532, None],
[-27.497103, 153.014817, "25: UQ Sports"],
[-27.496754, 153.014504, "02: Duhig Tower"],
[-27.496567, 153.014294, "02: Duhig Tower"],
[-27.497156, 153.014813, "32: Gordon Greenwod"]],
columns = ["latitude", "longitude", "user_service"])
def closest_point_service_name(point, points, user_services):
""" Find closest point with non null user_service """
#First we filter the points and user_services by the ones that don't have null user_service
points = points[user_services != None]
user_services = user_services[user_services != None]
#we use cdist to get all distances between pairs of points
distances = cdist([point], points)[0]
#we don't want to consider the current point
distances[distances == 0] = np.inf
#we get the index of the closest point
closest_point_index = distances.argmin()
#we return the user_service of the closest point that has a user_service
closest_point_user_service = user_services[closest_point_index]
return closest_point_user_service
#we convert the lat and long to a pair
df['point'] = [(x, y) for x,y in zip(df['latitude'], df['longitude'])]
#we create the additional column
df['closest'] = [closest_point_service_name(x, np.asarray(list(df['point'])), np.asarray(list(df['user_service']))) for x in df['point']]
#finally, we fill nulls
df.user_service = df.user_service.fillna(df['closest'])
del df['closest']
df
This is the output:
latitude longitude user_service point
0 -27.496404 153.014353 02: Duhig Tower (-27.496404, 153.014353)
1 -27.497107 153.014836 25: UQ Sports (-27.497107, 153.014836)
2 -27.497118 153.014890 25: UQ Sports (-27.497118, 153.01489)
3 -27.497154 153.014813 32: Gordon Greenwod (-27.497154, 153.014813)
4 -27.496437 153.014477 12: Duhig North (-27.496437, 153.014477)
5 -27.497156 153.014813 32: Gordon Greenwod (-27.497156, 153.014813)
6 -27.497097 153.014746 23: Abel Smith (-27.497097, 153.014746)
7 -27.496390 153.014415 32: Gordon Greenwod (-27.49639, 153.014415)
8 -27.497112 153.014780 03: Steele (-27.497112, 153.01478)
9 -27.497156 153.014813 32: Gordon Greenwod (-27.497156, 153.014813)
10 -27.496487 153.014622 02: Duhig Tower (-27.496487, 153.014622)
11 -27.497075 153.014532 23: Abel Smith (-27.497075, 153.014532)
12 -27.497103 153.014817 25: UQ Sports (-27.497103, 153.014817)
13 -27.496754 153.014504 02: Duhig Tower (-27.496754, 153.014504)
14 -27.496567 153.014294 02: Duhig Tower (-27.496567, 153.014294)
15 -27.497156 153.014813 32: Gordon Greenwod (-27.497156, 153.014813)