Search code examples
rtidyverserecode

create a dummy variable (using mutate) based on a pattern in a character string


I'm trying to figure out how to create a dummy variable based on a pattern in a character string. The point is to end up with a simple way to make certain aspects of my ggplot (color, linetype, etc.) the same for samples that have something in common (such as different types of mutations of the same gene -- each sample name contains the name of the gene, plus some other characters).

As an example with the iris dataset, let's say I want to add a column (my dummy variable) that will have one value for species whose names contain the letter "v", and another value for species that don't. (In the real dataset, I have many more possible categories.)

I've been trying to use mutate and recode, str_detect, or if_else, but can't seem to get the syntax right. For instance,

mutate(iris, 
    anyV = ifelse(str_detect('Species', "v"), "withV", "noV"))

doesn't throw any errors, but it doesn't detect that any of the species names contain a v, either. Which I think has to do with my inability to figure out how to get str_detect to work:

iris %>% 
  select(Species) %>%
  str_detect("setosa")

just returns [1] FALSE.

iris %>% 
  filter(str_detect('Species', "setosa"))

doesn't work, either.

(I've also tried things like a mutate/recode solution, based on an example in 7 Most Practically Useful Operations When Wrangling Text Data in R , but can't get that to work, either.)

What am I doing wrong? And how do I fix it?


Solution

  • This works:

    library(stringr)
    iris%>% mutate(
        anyV = ifelse(str_detect(Species, "v"), "withV", "noV"))
    
        Sepal.Length Sepal.Width Petal.Length Petal.Width    Species  anyV
    1            5.1         3.5          1.4         0.2     setosa   noV
    2            4.9         3.0          1.4         0.2     setosa   noV
    3            4.7         3.2          1.3         0.2     setosa   noV
    4            4.6         3.1          1.5         0.2     setosa   noV
    5            5.0         3.6          1.4         0.2     setosa   noV
    ...
    52           6.4         3.2          4.5         1.5 versicolor withV
    53           6.9         3.1          4.9         1.5 versicolor withV
    54           5.5         2.3          4.0         1.3 versicolor withV
    55           6.5         2.8          4.6         1.5 versicolor withV
    56           5.7         2.8          4.5         1.3 versicolor withV
    57           6.3         3.3          4.7         1.6 versicolor withV
    58           4.9         2.4          3.3         1.0 versicolor withV
    59           6.6         2.9          4.6         1.3 versicolor withV
    

    An alternative to nested ifelse statements:

    iris%>% mutate(newVar = case_when(
        str_detect(.$Species, "se") ~ "group1",
        str_detect(.$Species, "ve") ~ "group2",
        str_detect(.$Species, "vi") ~ "group3",
        TRUE ~ as.character(.$Species)))