Search code examples
rdplyrfiltermissing-datadata-wrangling

How to filter out data with conditional statement for series of numbers in R?


Data

Here is the data for my example:

#### Create Data ####
df <- data.frame(X1 = c(NA,1,1,1,0), 
                 X2 = c(1,1,1,0,0),
                 X3 = c(1,1,NA,0,0),
                 X4 = c(1,1,1,1,NA),
                 X5 = c(1,1,1,0,NA),
                 X6 = c(1,NA,1,1,NA)) %>% 
  as_tibble()

Problem

When you print the data, it looks like this:

# A tibble: 5 × 6
     X1    X2    X3    X4    X5    X6
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1    NA     1     1     1     1     1
2     1     1     1     1     1    NA
3     1     1    NA     1     1     1
4     1     0     0     1     0     1
5     0     0     0    NA    NA    NA

Basically there are cases where there is sporadic and random missingness in this data (rows 1-4). However, those with three zeroes in a row are those that have been converted to NA values after a stopping rule for multiple "wrong" answers (row 5). Theoretically I could just blindly remove these with the following code:

df %>% 
  mutate(across(everything(),
                ~ replace(.,
                          is.na(.),
                          0)))

And the NA's would be removed:

# A tibble: 5 × 6
     X1    X2    X3    X4    X5    X6
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1     0     1     1     1     1     1
2     1     1     1     1     1     0
3     1     1     0     1     1     1
4     1     0     0     1     0     1
5     0     0     0     0     0     0

However, it appears that this does not faithfully attack the problem. The NAs that are random are actually missing whereas the values that have been made NA are not. So I need a way to conditionally filter these values out for all cases where three 0s are recorded in a row, however I'm struggling with figuring out how to do this.


Solution

  • Using is.na we could paste0 the rows to strings and check if number of matches with 111 are greater than zero using stringi::stri_count to create a flag. After that, replace NAs with zeros if a flag is present.

    num_NA <- 3
    flag <- apply(+(is.na(df)), 1, paste0, collapse='') |>
      stringi::stri_count(regex=paste(rep(1, num_NA), collapse='')) |> base::`>`(0)
    
    df[flag, ] <- lapply(df[flag, ], \(x) replace(x, is.na(x), 0))
    df
    #   X1 X2 X3 X4 X5 X6
    # 1 NA  1  1  1  1  1
    # 2  1  1  1  1  1 NA
    # 3  1  1 NA  1  1  1
    # 4  1  0  0  1  0  1
    # 5  0  0  0  0  0  0
    

    Data:

    df <- structure(list(X1 = c(NA, 1, 1, 1, 0), X2 = c(1, 1, 1, 0, 0), 
        X3 = c(1, 1, NA, 0, 0), X4 = c(1, 1, 1, 1, NA), X5 = c(1, 
        1, 1, 0, NA), X6 = c(1, NA, 1, 1, NA)), class = "data.frame", row.names = c(NA, 
    -5L))