Search code examples
rdataframenamissing-data

Filling NA values with last non-NA's if between repeated identical non-NA values


I would like to replace the NA's values in my dataset with the previous non-NA value but only if the NA's are between identical values.

To illustrate here's a small sample of the data:

      date        1     2     3
1  2004-12-27     NA    NA    NA
2  2004-12-28  2.299 2.349 2.348
3  2004-12-29     NA    NA    NA
4  2005-01-03     NA    NA    NA
5  2005-01-04     NA    NA    NA
6  2005-01-05  2.299    NA    NA
7  2005-01-06     NA    NA    NA
8  2005-01-10     NA    NA    NA
9  2005-01-11  2.299 2.349 2.348
10 2005-01-12     NA    NA    NA
11 2005-01-17     NA    NA    NA
12 2005-01-18  2.299    NA    NA
13 2005-01-19     NA    NA    NA
14 2005-01-24     NA    NA    NA
15 2005-01-25     NA 2.369 2.368
16 2005-01-26  2.299    NA    NA
17 2005-01-31  2.299    NA    NA
18 2005-02-01     NA    NA    NA
19 2005-02-02     NA    NA    NA
20 2005-02-08     NA    NA    NA

The ideal output would be:

     date         1     2     3
1  2004-12-27     NA    NA    NA
2  2004-12-28  2.299 2.349 2.348
3  2004-12-29  2.299 2.349 2.348
4  2005-01-03  2.299 2.349 2.348
5  2005-01-04  2.299 2.349 2.348
6  2005-01-05  2.299 2.349 2.348
7  2005-01-06  2.299 2.349 2.348
8  2005-01-10  2.299 2.349 2.348
9  2005-01-11  2.299 2.349 2.348
10 2005-01-12  2.299    NA    NA
11 2005-01-17  2.299    NA    NA
12 2005-01-18  2.299    NA    NA
13 2005-01-19  2.299    NA    NA
14 2005-01-24  2.299    NA    NA
15 2005-01-25  2.299 2.369 2.368
16 2005-01-26  2.299    NA    NA
17 2005-01-31  2.299    NA    NA

Here's a reproducible sample of the dataset using dput:

structure(list(data_gas = structure(c(12779, 12780, 12781, 12786, 
12787, 12788, 12789, 12793, 12794, 12795, 12800, 12801, 12802, 
12807, 12808, 12809, 12814, 12815, 12816, 12822), class = "Date"), 
    `1` = c(NA, 2.299, NA, NA, NA, 2.299, NA, NA, 2.299, NA, 
    NA, 2.299, NA, NA, NA, 2.299, 2.299, NA, NA, NA), `3` = c(NA, 
    2.349, NA, NA, NA, NA, NA, NA, 2.349, NA, NA, NA, NA, NA, 
    2.369, NA, NA, NA, NA, NA), `4` = c(NA, 2.348, NA, NA, NA, 
    NA, NA, NA, 2.348, NA, NA, NA, NA, NA, 2.368, NA, NA, NA, 
    NA, NA)), row.names = c(NA, 20L), class = "data.frame")

I've tried a few for loops without sucess.

Any help will be greatly appreciated.


Solution

  • Here is a base R for loop solution.

    Write a function that compares two consecutive non-NA values and if they are the same fill the middle NA values with the same value.

    fill_NA_values <- function(x) {
      #Index of non-NA values
      non_na_values <- which(!is.na(x))
      #loop over each index.
      for(i in seq_along(non_na_values[-1])) {
        #If two consecutive non-NA value are the same
        if(x[non_na_values[i]] == x[non_na_values[i + 1]]) {
          #Fill the NA values in between with the value.
          x[(non_na_values[i] + 1):(non_na_values[i+1] -1)] <- x[non_na_values[i]]
        }
      }
      x
    }
    

    Apply this for multiple columns using lapply.

    df[-1] <- lapply(df[-1], fill_NA_values)
    df
    
    #         date    X1    X3    X4
    #1  2004-12-27    NA    NA    NA
    #2  2004-12-28 2.299 2.349 2.348
    #3  2004-12-29 2.299 2.349 2.348
    #4  2005-01-03 2.299 2.349 2.348
    #5  2005-01-04 2.299 2.349 2.348
    #6  2005-01-05 2.299 2.349 2.348
    #7  2005-01-06 2.299 2.349 2.348
    #8  2005-01-10 2.299 2.349 2.348
    #9  2005-01-11 2.299 2.349 2.348
    #10 2005-01-12 2.299    NA    NA
    #11 2005-01-17 2.299    NA    NA
    #12 2005-01-18 2.299    NA    NA
    #13 2005-01-19 2.299    NA    NA
    #14 2005-01-24 2.299    NA    NA
    #15 2005-01-25 2.299 2.369 2.368
    #16 2005-01-26 2.299    NA    NA
    #17 2005-01-31 2.299    NA    NA
    #18 2005-02-01    NA    NA    NA
    #19 2005-02-02    NA    NA    NA
    #20 2005-02-08    NA    NA    NA