Search code examples
rplyrmissing-data

R - Replace specific value contents with NA


I have a fairly large data frame that has multiple "-" which represent missing data. The data frame consisted of multiple Excel files, which could not use the "na.strings =" or alternative function, so I had to import them with the "-" representation.

How can I replace all "-" in the data frame with NA / missing values? The data frame consists of 200 columns of characters, factors, and integers.

So far I have tried:

sum(df %in c("-"))
returns: [1] 0

df[df=="-"] <-NA #does not do anything

library(plyr)
df <- revalue(df, c("-",NA))
returns: Error in revalue(tmp, c("-", NA)) : 
  x is not a factor or a character vector.

library(anchors)
df <- replace.value(df,colnames(df),"-",as.character(NA))
Error in charToDate(x) : 
  character string is not in a standard unambiguous format

The data frame consists of 200 columns of characters, factors, and integers, so I can see why the last two do not work correctly. Any help would be appreciated.


Solution

  • Since you're already using tidyverse functions, you can easily use na_if from dplyr within your pipes.

    For example, I have a dataset where 999 is used to fill in a non-answer:

    df <- tibble(
        alpha = c("a", "b", "c", "d", "e"), 
        val1 = c(1, 999, 3, 8, 999), 
        val2 = c(2, 8, 999, 1, 2))
    

    If I wanted to change val1 so 999 is NA, I could do:

    df %>% 
        mutate(val1 = na_if(val1, 999))
    

    In your case, it sounds like you want to replace a value across multiple variables, so using across for multiple columns would be more appropriate:

    df %>%
        mutate(across(c(val1, val2), na_if, 999)) # or val1:val2
    

    replaces all instances of 999 in both val1 and val2 with NA and now looks like this:

    # A tibble: 5 x 3
      alpha  val1  val2
      <chr> <dbl> <dbl>
    1 a        1.    2.
    2 b       NA     8.
    3 c        3.   NA 
    4 d        8.    1.
    5 e       NA     2.