Search code examples
rtidyselect

Replace NAs based on conditions in R


I have a dataset and I want to replace NAs with empty string in those columns where the number of missing values is greater or equal to n. For instance, n = 500.

set.seed(2022)

synthetic <- tibble(
  col1 = runif(1000),
  col2 = runif(1000),
  col3 = runif(1000)
)

na_insert <- c(sample(nrow(synthetic), 500, replace = FALSE))

synthetic[na_insert, 1] <- NA

What I am trying to do and eventually fail:

synthetic %>% 
  mutate(across(everything(), ~ replace_na(sum(is.na(.x)) >= 500, "")))

What am I doing wrong in this primitive exercise?


Solution

  • You could make use of where with a purrr-like function:

    library(dplyr)
    
    synthetic %>% 
        mutate(across(where(~sum(is.na(.x)) >= 500), ~coalesce(as.character(.x), "")))
    

    This returns

    # A tibble: 1,000 x 3
       col1                  col2   col3
       <chr>                <dbl>  <dbl>
     1 ""                   0.479 0.139 
     2 "0.647259329678491"  0.410 0.770 
     3 ""                   0.696 0.805 
     4 ""                   0.863 0.803 
     5 "0.184729989385232"  0.146 0.652 
     6 "0.635790845612064"  0.634 0.0830
     7 ""                   0.305 0.527 
     8 "0.0419759317301214" 0.297 0.275 
     9 ""                   0.883 0.698 
    10 "0.757252902723849"  0.115 0.933 
    # ... with 990 more rows