Search code examples
rdataframedplyrtidyr

fill depending on the number of consecutive NAs


Here is the example dataframe:

df1 <- read.table(text = "DT   odczyt.1
'2023-08-16 06:57:15'   482.307
'2023-08-16 06:57:30'   482.307
'2023-08-16 06:57:45'   482.311
'2023-08-16 06:58:00'   NA
'2023-08-16 06:58:15'   NA
'2023-08-16 06:58:30'   482.315
'2023-08-16 06:58:45'   482.315
'2023-08-16 06:59:00'   482.318
'2023-08-16 06:59:15'   482.324
'2023-08-16 06:59:30'   NA
'2023-08-16 06:59:45'   NA
'2023-08-16 07:00:00'   NA
'2023-08-16 07:00:15'   NA
'2023-08-16 07:00:30'   NA
'2023-08-16 07:00:45'   NA", header = TRUE)

I want to complete the data, but only if NA occurs in a maximum of three consecutive rows. When there are more missing data, NA remains.

tidyr::fill runs continuously. Adding a condition for a group doesn't seem to be justified either (I found such an example Fill NA with condition based on the same column).

In addition, the last value in the column can be NA.

Instead of fill, I also consider na.approx. na.approx would be better, but I don't know if it's possible because of the NA in the last line

Expected result

df1 <- read.table(text = "DT   odczyt.1
'2023-08-16 06:57:15'   482.307
'2023-08-16 06:57:30'   482.307
'2023-08-16 06:57:45'   482.311
'2023-08-16 06:58:00'   482.311
'2023-08-16 06:58:15'   482.311
'2023-08-16 06:58:30'   482.315
'2023-08-16 06:58:45'   482.315
'2023-08-16 06:59:00'   482.318
'2023-08-16 06:59:15'   482.324
'2023-08-16 06:59:30'   NA
'2023-08-16 06:59:45'   NA
'2023-08-16 07:00:00'   NA
'2023-08-16 07:00:15'   NA
'2023-08-16 07:00:30'   NA
'2023-08-16 07:00:45'   NA", header = TRUE)

Solution

  • An approach that does fill first and then subsequently conditionally filters with ifelse

    library(dplyr)
    library(tidyr)
    
    df1 %>% 
      mutate(new = odczyt.1) %>% 
      fill(new) %>% 
      mutate(grp = consecutive_id(odczyt.1)) %>% 
      mutate(odczyt.1 = ifelse(sum(is.na(odczyt.1)) <= 3 & is.na(odczyt.1), 
                          new, odczyt.1), .by = grp) %>% 
     select(-c(new, grp))
                        DT odczyt.1
    1  2023-08-16 06:57:15  482.307
    2  2023-08-16 06:57:30  482.307
    3  2023-08-16 06:57:45  482.311
    4  2023-08-16 06:58:00  482.311
    5  2023-08-16 06:58:15  482.311
    6  2023-08-16 06:58:30  482.315
    7  2023-08-16 06:58:45  482.315
    8  2023-08-16 06:59:00  482.318
    9  2023-08-16 06:59:15  482.324
    10 2023-08-16 06:59:30       NA
    11 2023-08-16 06:59:45       NA
    12 2023-08-16 07:00:00       NA
    13 2023-08-16 07:00:15       NA
    14 2023-08-16 07:00:30       NA
    15 2023-08-16 07:00:45       NA