R filling in missing text data when an approximate GPS coordinate is known down to the street corner level of position, i.e. very close gps coordinate

Working with R I have a data set with 3,000,000 observations and 17 columns. start_at. lat. long. location1. x.11. y.11 location2 x.12. y.11 location3. x.11. y.12 location4. x.13. y.13 . x.10. y.12 location3. x.10 y.12

Two of the columns are missing location name data, 150,000 of them. Two of column are GPS latitude and longitude and each row of the columns is filled in with values. The values represent the gps coordinates of a bicycle renting station (750 of them) I want to fill in the missing name in the station column by comparing known column names with known gps coordinates to match up and fill in the blanks.

Also the gps coordinates are not exact per station has the coordinates are precise enough such that each slot has a slightly different coordinate so each station name is associated with a range of coordinates.

station_name. Lat.  long.
X              91.1234  -87.4848
y              93.1245. -87.9876
z              92.8488. -86.8765
z              92.8478. -86.8800
x              91.1245. -87.5000
missing.       91.1233  -87.4850

I need to know how to fill in the 'missing' with an 'x' as that is the closest range of observations for the lat/long.

I would do this by hand except there are 150,000 missing station names, out of 700 unique station names, in a df with 3,000,000 rows.

Is there a way to say 'if the lat +/- x (range) then find another station name that has gps within that range. Or.

How to find the min and max coordinates for a known station name and insert that station name into the missing observation. Or.

How to normalize all the GPS coordinates so that only one coordinate matches each station name. I.e. look at a unique station name, normalize all of its known coordinates to a single pair, And then fill in that station missing name.

Solution

You can use the wonderful s2 package here:

library(s2)
library(dplyr)

df <- data.frame(station_name = c("a", "b", "c", "d", "e", "missing"), 
                 lat = c(81.1234, 83.1245, 82.8488, 82.8478, 81.1245, 81.1233), 
                 lon = c(-87.4848, -87.9876, -86.8765, -86.8800, -87.5000, -87.4850))

Create two data frames:


# data.frame with missings
df_missings <- df %>%
  filter(station_name == "missing")

# data.frame without missings
df_nomissings <- df %>%
  filter(station_name != "missing")

s2_closest_feature() from the s2 package finds nearest points from different data sets.

missings_s2 <- s2_lnglat(df_missings$lon, df_missings$lat)
nomissings_s2 <- s2_lnglat(df_nomissings$lon, df_nomissings$lat)
df_missings$station_name <- df$station_name[s2_closest_feature(missings_s2, nomissings_s2)]

Bind data.frames together:

bind_rows(df_missings, df_nomissings)

And if you would like to replace the station name only when the station is within a certain radius:

library(spatialrisk)

df_missings1 <- df %>%
  filter(station_name == "missing")

df_missings1$closest_station <- df$station_name[s2_closest_feature(missings_s2, nomissings_s2)]

# Radius in meters
radius_meters <- 100

# Find distance to closest station
df_closest <- df_missings1 %>%
  left_join(., df_nomissings, by = c("closest_station" = "station_name")) %>%
  mutate(distance = spatialrisk::haversine(lat.x, lon.x, lat.y, lon.y)) %>% # in meters
  mutate(station_name = ifelse(distance < radius_meters, closest_station, station_name)) 

df_closest %>%
  select(station_name, lat = lat.x, lon = lon.x) %>%
  bind_rows(df_nomissings)