I'm trying to figure out how to create a dummy variable based on a pattern in a character string. The point is to end up with a simple way to make certain aspects of my ggplot (color, linetype, etc.) the same for samples that have something in common (such as different types of mutations of the same gene -- each sample name contains the name of the gene, plus some other characters).
As an example with the iris dataset, let's say I want to add a column (my dummy variable) that will have one value for species whose names contain the letter "v", and another value for species that don't. (In the real dataset, I have many more possible categories.)
I've been trying to use mutate
and recode
, str_detect
, or if_else
, but can't seem to get the syntax right. For instance,
mutate(iris,
anyV = ifelse(str_detect('Species', "v"), "withV", "noV"))
doesn't throw any errors, but it doesn't detect that any of the species names contain a v, either. Which I think has to do with my inability to figure out how to get str_detect
to work:
iris %>%
select(Species) %>%
str_detect("setosa")
just returns [1] FALSE
.
iris %>%
filter(str_detect('Species', "setosa"))
doesn't work, either.
(I've also tried things like a mutate/recode solution, based on an example in 7 Most Practically Useful Operations When Wrangling Text Data in R , but can't get that to work, either.)
What am I doing wrong? And how do I fix it?
This works:
library(stringr)
iris%>% mutate(
anyV = ifelse(str_detect(Species, "v"), "withV", "noV"))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species anyV
1 5.1 3.5 1.4 0.2 setosa noV
2 4.9 3.0 1.4 0.2 setosa noV
3 4.7 3.2 1.3 0.2 setosa noV
4 4.6 3.1 1.5 0.2 setosa noV
5 5.0 3.6 1.4 0.2 setosa noV
...
52 6.4 3.2 4.5 1.5 versicolor withV
53 6.9 3.1 4.9 1.5 versicolor withV
54 5.5 2.3 4.0 1.3 versicolor withV
55 6.5 2.8 4.6 1.5 versicolor withV
56 5.7 2.8 4.5 1.3 versicolor withV
57 6.3 3.3 4.7 1.6 versicolor withV
58 4.9 2.4 3.3 1.0 versicolor withV
59 6.6 2.9 4.6 1.3 versicolor withV
An alternative to nested ifelse
statements:
iris%>% mutate(newVar = case_when(
str_detect(.$Species, "se") ~ "group1",
str_detect(.$Species, "ve") ~ "group2",
str_detect(.$Species, "vi") ~ "group3",
TRUE ~ as.character(.$Species)))