I have already looked on SO for an answer to this question, but didn't manage to find a solution to my problem.
I have a dataframe with several columns, each of which has at least one NA. Names of these columns are stored in character vector vars_na
. For each of those, I would like to create a dummy variable taking value 0 if the value for that observation is missing, and 1 otherwise.
Below there is a reproducible toy example and the code I used up to now:
# creation of toy dataset
iris[1:5, 1] <- rep(NA, 5)
iris[1:10, 4] <- rep(NA, 10)
vars_na <- c("Sepal.Length", "Petal.Width")
for(var in vars_na){
iris <- iris %>%
mutate(dummy = ifelse(is.na(!!var), 0, 1)) %>%
rename_at(c("dummy"), list(~paste0("dummyna_", var)))
# 'rename_at' is just to differentiate between the several dummies created,
# and it works correctly
}
The problem is that the newly created dummies result in being vector full of 1's, so they do not consider missing values correctly; indeed:
head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species dummyna_Sepal.Length dummyna_Petal.Width
1 NA 3.5 1.4 NA setosa 1 1
2 NA 3.0 1.4 NA setosa 1 1
3 NA 3.2 1.3 NA setosa 1 1
4 NA 3.1 1.5 NA setosa 1 1
5 NA 3.6 1.4 NA setosa 1 1
6 5.4 3.9 1.7 NA setosa 1 1
but I would like to obtain
Sepal.Length Sepal.Width Petal.Length Petal.Width Species dummyna_Sepal.Length dummyna_Petal.Width
1 NA 3.5 1.4 NA setosa 0 0
2 NA 3.0 1.4 NA setosa 0 0
3 NA 3.2 1.3 NA setosa 0 0
4 NA 3.1 1.5 NA setosa 0 0
5 NA 3.6 1.4 NA setosa 0 0
6 5.4 3.9 1.7 NA setosa 1 0
The code is simple and I believed it should work. What am I doing wrong instead? Thanks in advance.
The problem is that since var
is a character,
something like is.na(!!var)
ends up as something like is.na("Sepal.Length")
,
which is always false.
You can use rlang::sym
* to transform characters to symbols that can be evaluated by mutate
for example:
for (var in vars_na) {
var_sym <- rlang::sym(var)
new_name <- rlang::sym(paste0(var, "_na"))
iris <- iris %>%
mutate(!!new_name := as.integer(!is.na(!!var_sym)))
}
*The rlang
package serves at the basis for most of the non-standard evaluation dplyr
supports,
see tidy evaluation.