Search code examples
rloopssubsetmissing-data

Recoding missings in over a subset of columns


This is the first time that I am asking a question in the forum, so I apologize if I am not clear with my question. Also, just to clarify, I have multiple questions. I would like to substitute all the 99, 98, 97 values to missing (NA) values on a subset of columns in my data frame.

First I came up with the following code:

df1 <- df1 %>% replace_with_na_at(.vars = c("E3007_A","E3007_B",
                                            "E3007_C","E3007_D",
                                            "E3007_E","E3007_F",
                                            "E3007_G","E3007_H"),
  condition = ~ .x >94) 

It was working but it was taking 15 to 20 minutes to run this line of code, it was taking freaking FOREVER and driving me nuts. So instead, I decided to use the standard R code:

df1$E3007_A[df1$E3007_A > 94] <- NA
df1$E3007_B[df1$E3007_B > 94] <- NA
df1$E3007_C[df1$E3007_C > 94] <- NA
df1$E3007_D[df1$E3007_D > 94] <- NA
df1$E3007_E[df1$E3007_E > 94] <- NA
df1$E3007_F[df1$E3007_F > 94] <- NA
df1$E3007_G[df1$E3007_G > 94] <- NA
df1$E3007_H[df1$E3007_H > 94] <- NA

So, I have two questions: Why did the first code take forever while the second code was so much faster?

As for my second question:

I tried to run a loop instead of writing 8 lines of repetitive codes:

seq <- LETTERS[seq(1,8)]
for(i in seq){
  df1$E3007_[[i]][df1$E3007_[[i]] > 94] <- NA
}

I tried variations of this code, but the loop is just not working. What am I missing?


Solution

  • We can compare subset of multiple columns at once, then assign NA:

    cols <- c("E3007_A","E3007_B",
              "E3007_C","E3007_D",
              "E3007_E","E3007_F",
              "E3007_G","E3007_H")
    
    df1[, cols][ df1[, cols ] > 94 ] <- NA