Search code examples
rtidyrplyr

Find rows with incomplete set depending on a factor, then replace values that exist by NA for the incomplete set


I cannot work this one out.

I have an incomplete dataset (many rows and variables) with one factor that specify whether all the other variables are pre- or post- something. I need to get summary statistics for all variables pre- and post- only including rows where the pre- AND post- values are not NA.

I am trying to find a way to replace existing values with NA if the set is incomplete separately for each variable.

The following is a simple example of what I am trying to achieve:

  df = data.frame(
    id = c(1,1,2,2),
    myfactor = as.factor(c(1,2,1,2)),
    var2change = c(10,10,NA,20),
    var3change = c(5,10,15,20),
    var4change = c(NA,2,3,8)
     )

which leads to:

  id myfactor var2change var3change var4change
1  1        1         10          5         NA
2  1        2         10         10          2
3  2        1         NA         15          3
4  2        2         20         20          8

My desired output would be:

  id myfactor var2change var3change var4change
1  1        1         10          5         NA
2  1        2         10         10         NA
3  2        1         NA         15          3
4  2        2         NA         20          8

I have much more than one variable to deal with and the set is incomplete in a different way for each variable independently. I have the feeling this may be achieved with smart use of existing functions from the plyr / tidyr packages but I cannot find an elegant way to apply the concepts to my problem.

Any help would be appreciated.


Solution

  • You can group by id and if any value has NA in it replace all of them with NA. To apply a function to multiple columns we use across.

    library(dplyr)
    
    df %>%
      group_by(id) %>%
      mutate(across(starts_with('var'), ~if(any(is.na(.))) NA else .))
      #for dplyr < 1.0.0 we can use `mutate_at`
      #mutate_at(vars(starts_with('var')), ~if(any(is.na(.))) NA else .)
    
    #     id myfactor var2change var3change var4change
    #  <dbl> <fct>         <dbl>      <dbl>      <dbl>
    #1     1 1                10          5         NA
    #2     1 2                10         10         NA
    #3     2 1                NA         15          3
    #4     2 2                NA         20          8