Search code examples
rdplyr

Group by using str_detect for groups with similar strings


Consider this example data:

library(tidyverse)

dt <- tibble(Poison = c('Arsenic', 'Arsenic in Wine', 'Cyanide', 'Cyanide and Sugar'),
             Result = c('Death', 'Death With Class', 'Death', 'Death'))

I want to create a column that gives each group an identification number. However, I want the poisons to be grouped together by a string detection, i.e., 'Arsenic' and 'Arsenic in Wine' to be one group and 'Cyanide' and 'Cyanide and Sugar' to be another group. Currently, R thinks that each group is it's own, as such:

dt <- dt %>%
  group_by(Poison) %>%
  mutate(Group = n())
# A tibble: 4 × 3
# Groups:   Poison [4]
  Poison            Result           Group
  <chr>             <chr>            <int>
1 Arsenic           Death                1
2 Arsenic in Wine   Death With Class     1
3 Cyanide           Death                1
4 Cyanide and Sugar Death                1

I want it to be so that 'Arsenic' and 'Arsenic in Wine' is Group 1, and 'Cyanide', and 'Cyanide and Sugar' is Group 2. Any ideas?


Solution

  • A combination of case_when and grepl could be useful:

    dt %>% 
      mutate(Group = case_when(
        grepl("Arsenic", Poison) ~ 1,
        grepl("Cyanide", Poison) ~ 2
      ))
    # A tibble: 4 × 3
      Poison            Result           Group
      <chr>             <chr>            <dbl>
    1 Arsenic           Death                1
    2 Arsenic in Wine   Death With Class     1
    3 Cyanide           Death                2
    4 Cyanide and Sugar Death                2
    

    If you don't want to write down any poisson, this could be useful:

    dt %>% 
      mutate(Group = sub(" .*", "", Poison) %>% 
               as.factor %>% 
               as.integer())