Search code examples
rdplyrtidyrzoolocf

Fill empty cells between two values in column with last non empty cell and next non empty cell in R


I need to loop over IDs in a dataframe to fill NA values in a column by attributing empty cells evenly between the last and first filled entry outside of the NA cells.

ID     Value    X     Y
1          A      x     y 
1         NA      x     y 
1         NA      x     y 
1         NA      x     y 
1         NA      x     y 
1         NA      x     y 
1          B      x     y
2          C      x     y
2         NA      x     y
2         NA      x     y
2         NA      x     y
2         NA      x     y
2          D      x     y

Which should be filled to this:

ID     Value    X     Y
1          A      x     y 
1          A      x     y 
1          A      x     y 
1          B      x     y 
1          B      x     y 
1          B      x     y 
1          B      x     y
2          C      x     y
2          C      x     y
2          C      x     y
2          D      x     y
2          D      x     y
2          D      x     y

In case of 2n NA values between observations, n is attributed to the last and n to the next. In case of 2n+1 values, n is attributed to the last and n+1 to the next.

I know I need to use na.locf from the zoo package which works well with a large database for filling in empty values based on the last non-empty cell, along with the fromLast argument to perform "next observation carried backwards". I cannot however structure a loop to account for an even or odd number of NA values, and use both of these together.

Using the tidyverse package,

> library(tidyr)
> library(dplyr)
> df %>% dplyr::group_by(test$id) %>% fill(Value, .direction ="downup") %>% dplyr::ungroup()

This fills in NA values in both directions but does not account for different border values for NA cells in a group.


Solution

  • Define interp which replaces each successive non-NA with successive integers, applies na.appro9x, rounds and replaces the resulting integers with the original values.

    library(zoo)
    
    interp <- function(x) {
       x0 <- ifelse(is.na(x), NA, cumsum(!is.na(x)))
       xx <- na.approx(x0, rule = 2)
       na.omit(x)[round(xx)]
    }
    transform(DF, Value = interp(Value))
    

    giving:

       ID Value X Y
    1   1     A x y
    2   1     A x y
    3   1     A x y
    4   1     B x y
    5   1     B x y
    6   1     B x y
    7   1     B x y
    8   2     C x y
    9   2     C x y
    10  2     C x y
    11  2     D x y
    12  2     D x y
    13  2     D x y
    

    Note

    It is assumed that the input is the following, shown in reproducible form.

    Lines <- "ID     Value    X     Y
    1          A      x     y 
    1         NA      x     y 
    1         NA      x     y 
    1         NA      x     y 
    1         NA      x     y 
    1         NA      x     y 
    1          B      x     y
    2          C      x     y
    2         NA      x     y
    2         NA      x     y
    2         NA      x     y
    2         NA      x     y
    2          D      x     y"
    DF <- read.table(text = Lines, header = TRUE)