Search code examples
rtidyversestringrtidytext

What would be the best approach for grouping data according to a table of keywords in R


I have the following dictionary for grouping data

 1. [aa11, aa21, aa31, aa34],      "group A"
 2. [x23z, x22z, x32z, x35z, x34z],"group B"
 3. [lg32z, lg22z, lg84x, lg94y],  "group C"
 4. ...

The column in the data itself may also have more than one code. What I want is use the dictionary above and assign the groups to the data:

 1. [aa31, aa34], "group A"
 2. [lg94z],      "group C"
 3. [lg84x],      "group C"
 4. [x22z, x23z]  "group B"

What would be the best way to assign the records to the specific group described in the first table.

My thoughts so far: Should the data be first reshaped (tokenized) to have one code per record in both the "dictionary" table and the processed table?


Solution

  • There are two scenarios:

    1. Your code variable might contain several codes, but all from the same group.
    2. Your code variable might contain several codes, but they could be from different groups.

    Lookup input for both solutions

    library(tidyverse)
    library(tidytext)
    
    lookup <-  data.frame(code  = c("[aa11, aa21, aa31, aa34]",
                                    "[x23z, x22z, x32z, x35z, x34z]",
                                    "[lg32z, lg22z, lg84x, lg94y]"),
                          group = c("group A", "group B", "group C"))
    
    lookup_long <- lookup %>%
      unnest_tokens(words, code, "words")
    

    Solution for 1.

    You had a good idea to first tokenize the lookup and the input data and then do the match based on the word tokens. In case you have several codes in your code variable, I assume you just want to keep the information once, i.e. have just ONE column with the group information.

    df_1 <- data.frame(code = c("[aa31, aa34]", "[lg94y]", "[lg84x]", "[x22z, x23z]"),
                       id   = 1:4)
    
    df_1 %>%
      unnest_tokens(code_new, code, "words", drop = FALSE) %>%
      left_join(lookup_long, by = c("code_new" = "words")) %>%
      group_by(id) %>%
      distinct(group, .keep_all = TRUE) %>%
      ungroup() %>%
      select(-id, -code_new)
    

    which gives:

    # A tibble: 4 x 2
      code         group  
      <chr>        <chr>  
    1 [aa31, aa34] group A
    2 [lg94y]      group C
    3 [lg84x]      group C
    4 [x22z, x23z] group B
    

    Solution for 2.

    It's generally the same idea, but it needs a bit more reshaping to store the group information for all codes.

    df_2 <- data.frame(code = c("[aa31, aa34]", "[lg94y]", "[lg84x]", "[x22z, x23z]", "[x22z, aa11]"),
                       id   = 1:5)
    
    df_2 %>%
      unnest_tokens(code_new, code, "words", drop = FALSE) %>%
      left_join(lookup_long, by = c("code_new" = "words")) %>%
      group_by(id) %>%
      distinct(group, .keep_all = TRUE) %>%
      mutate(id_wide = 1:n()) %>%
      ungroup() %>%
      pivot_wider(values_from  = group,
                  names_from   = id_wide,
                  names_prefix = "group_",
                  id_cols      = code)
    

    which gives:

    # A tibble: 5 x 3
      code         group_1 group_2
      <chr>        <chr>   <chr>  
    1 [aa31, aa34] group A <NA>   
    2 [lg94y]      group C <NA>   
    3 [lg84x]      group C <NA>   
    4 [x22z, x23z] group B <NA>   
    5 [x22z, aa11] group B group A