Search code examples
rdistancetidyversemissing-datageosphere

Imputation missing environmental data based on geospatial distance


I want to impute missing values of Temp at a Station by replacing the mean of the Temp at two closest Stations.

library(tidyverse)
library(lubridate)

tb1 <- 
  tibble::tibble(
    Date = as_date(rep(c("2019-01-01", "2019-01-02"), each = 4))
  , Stat = rep(c("F", "L", "M", "R"), times = 2)
  , Lat  = rep(c(31.418715, 31.582045, 30.181459, 33.626057), times = 2)
  , Long = rep(c(73.079109, 74.329376, 71.492157, 73.071442), times = 2)
  , Temp = c(NA, 20, 28, 25, 26, 25, NA, 24)
  )

tb1

# A tibble: 8 x 5
  Date       Stat    Lat  Long  Temp
  <date>     <chr> <dbl> <dbl> <dbl>
1 2019-01-01 F      31.4  73.1    NA
2 2019-01-01 L      31.6  74.3    20
3 2019-01-01 M      30.2  71.5    28
4 2019-01-01 R      33.6  73.1    25
5 2019-01-02 F      31.4  73.1    26
6 2019-01-02 L      31.6  74.3    25
7 2019-01-02 M      30.2  71.5    NA
8 2019-01-02 R      33.6  73.1    24

This replacing the missing values by the average of non-missing temp at all stations.

impute.mean <- function(x) {
  replace(x, is.na(x), mean(x, na.rm = TRUE))
  }

tb1 %>% 
  group_by(Date) %>% 
  mutate(Temp1 = impute.mean(Temp))

# A tibble: 8 x 6
# Groups:   Date [2]
  Date       Stat    Lat  Long  Temp Temp1
  <date>     <chr> <dbl> <dbl> <dbl> <dbl>
1 2019-01-01 F      31.4  73.1    NA  24.3
2 2019-01-01 L      31.6  74.3    20  20  
3 2019-01-01 M      30.2  71.5    28  28  
4 2019-01-01 R      33.6  73.1    25  25  
5 2019-01-02 F      31.4  73.1    26  26  
6 2019-01-02 L      31.6  74.3    25  25  
7 2019-01-02 M      30.2  71.5    NA  25  
8 2019-01-02 R      33.6  73.1    24  24

Used this code the find the distance between two Stations

library(geosphere)

distm(
    x = c(73.079109, 31.418715)
  , y = c(74.329376, 31.582045)
  , fun = distHaversine
  )

         [,1]
[1,] 120053.3

Could not figure out how to calculate distance using tidyverse?

tb1 %>% 
  mutate(
    Dist = distm(
          x = c(Long, Lat)
        , y = c(Long, Lat)
        , fun = distHaversine
        )
    )

Error in .pointsToMatrix(x) : Wrong length for a vector, should be 2


Solution

  • I add below a solution using the spatialrisk package. The key functions in this package are written in C++ (Rcpp), and are therefore very fast.

    Stations in tb1 without an observation for Temp:

    tb1_na <- tb1 %>% filter(is.na(Temp)) 
    

    Create function to determine distances to stations for a certain date:

    circle_fn <- function(x, y, z){
      spatialrisk::points_in_circle(tb1 %>% filter(Date == z), 
                                    lon_center = x, 
                                    lat_center = y, 
                                    lon = Long, 
                                    lat = Lat, 
                                    radius = 1e6)
    }
    

    Since each element of the output is a data frame, purrr::map_dfr is used to row-bind them together:

    purrr::pmap_dfr(list(tb1_na$Long, tb1_na$Lat, tb1_na$Date), 
                    circle_fn, .id = "tb1_na") %>%
       group_by(tb1_na) %>%
       slice(2:3) %>%
       summarize(Temp = mean(Temp)) %>%
       ungroup() %>%
       bind_cols(tb1_na, .) %>%
       select(-tb1_na)
    

    Output:

      Date       Stat    Lat  Long  Temp Temp1
      <date>     <chr> <dbl> <dbl> <dbl> <dbl>
    1 2019-01-01 F      31.4  73.1    NA  24  
    2 2019-01-02 M      30.2  71.5    NA  25.5