I cannot work this one out.
I have an incomplete dataset (many rows and variables) with one factor that specify whether all the other variables are pre- or post- something. I need to get summary statistics for all variables pre- and post- only including rows where the pre- AND post- values are not NA.
I am trying to find a way to replace existing values with NA if the set is incomplete separately for each variable.
The following is a simple example of what I am trying to achieve:
df = data.frame(
id = c(1,1,2,2),
myfactor = as.factor(c(1,2,1,2)),
var2change = c(10,10,NA,20),
var3change = c(5,10,15,20),
var4change = c(NA,2,3,8)
)
which leads to:
id myfactor var2change var3change var4change
1 1 1 10 5 NA
2 1 2 10 10 2
3 2 1 NA 15 3
4 2 2 20 20 8
My desired output would be:
id myfactor var2change var3change var4change
1 1 1 10 5 NA
2 1 2 10 10 NA
3 2 1 NA 15 3
4 2 2 NA 20 8
I have much more than one variable to deal with and the set is incomplete in a different way for each variable independently. I have the feeling this may be achieved with smart use of existing functions from the plyr / tidyr packages but I cannot find an elegant way to apply the concepts to my problem.
Any help would be appreciated.
You can group by id
and if any value has NA
in it replace all of them with NA
. To apply a function to multiple columns we use across
.
library(dplyr)
df %>%
group_by(id) %>%
mutate(across(starts_with('var'), ~if(any(is.na(.))) NA else .))
#for dplyr < 1.0.0 we can use `mutate_at`
#mutate_at(vars(starts_with('var')), ~if(any(is.na(.))) NA else .)
# id myfactor var2change var3change var4change
# <dbl> <fct> <dbl> <dbl> <dbl>
#1 1 1 10 5 NA
#2 1 2 10 10 NA
#3 2 1 NA 15 3
#4 2 2 NA 20 8