Search code examples
rtime-seriesinterpolationpopulatelocf

Replace NA values with the first value in all directions in R


I am looking to populate missing values in my table with non-NA value of the closest date, be it before or after the reference date. This means that a table like:

   date         value
03.03.2023        1
04.03.2023       NA
06.03.2023        4
09.03.2023       NA 
10.03.2023        3

Would be filled as:

   date         value
03.03.2023        1
04.03.2023        1
06.03.2023        4
09.03.2023        3 
11.03.2023        3

Explanation: Since 03.03. is closer to 04.03., locf is used. Yet since 11.03. is closer to 09.03., nocb (locf, fromLast = T) is used.

Eventual conflicts could take place if a NA value is flanked by two values of equal distance to the reference date. In this case, I would like locf to be preferred.

The code I have at the moment uses the stiff "locf" twice (once as standard and once as fromLast) and is not as flexible:

read.csv("path/to/merged_data.csv",
         colClasses = c("Date", "numeric", "numeric", "numeric", "character")) %>%
  group_by(field_id) %>%
  arrange(date) %>%
  mutate(
    Nearest_l8_locf = ifelse(!is.na(NDVI_l7) & is.na(NDVI_l8), na.locf(NDVI_l8), NDVI_l8),
    Nearest_s2_locf = ifelse(!is.na(NDVI_l7) & is.na(NDVI_s2), na.locf(NDVI_s2), NDVI_s2),
    Nearest_l8_locb = ifelse(!is.na(NDVI_l7) & is.na(NDVI_l8), na.locf(NDVI_l8, fromLast = TRUE), NDVI_l8),
    Nearest_s2_locb = ifelse(!is.na(NDVI_l7) & is.na(NDVI_s2), na.locf(NDVI_s2, fromLast = TRUE), NDVI_s2)
  ) %>%
  filter(!is.na(NDVI_l7)) %>%
  select(-NDVI_l8, -NDVI_s2) %>%
  relocate(field_id, .after = last_col()) %>%
  write_csv(file.path(results, "merged_data_interpolated.csv"))

In my actual case, the reference date are all dates for which a column (NDVI_l7) is not NA and the procedure to populate NA is done for two other columns (NDVI_l8 and NDVI_s2). It is also grouped by the column "field_id" since dates are repeated for each of those ID.

How can I adapt the code so that NA values are populated with the values of the closest date, regardless of where it is in the column?


Solution

  • another variant with base R only:

    d being your example data:

    d <- structure(list(date = structure(c(19419, 19420, 19422, 19425, 
    19426), class = "Date"), value = c(1L, NA, 4L, NA, 3L)), row.names = c(NA, 
    5L), class = "data.frame")
    

    convert column date to class Date:

    d$date <- as.Date(d$date, '%d.%m.%Y')
    

    exploit the distance function to find closest neighbour:

    impute_from_neighbours <- function(values, dates){
      dists <- dist(dates) |> as.matrix()
      dists[dists == 0] <- NA
      na_pos <- which(is.na(values))
      closest_non_na_pos <- apply(dists[, na_pos], 2, which.min)
      values[na_pos] <- values[closest_non_na_pos]
      values
    }
    
    d$value <- impute_from_neighbours(d$value, d$date)
    

    output:

    > d
            date value
    1 2023-03-03     1
    2 2023-03-04     1
    3 2023-03-06     4
    4 2023-03-09     3
    5 2023-03-10     3