Search code examples
rcategorieslookuplookup-tables

Create category based on keyword in R


I have a dataframe containing two columns: 1st column is the keyword and 2nd is the associated category.

keywords <- c("keyword1", "keyword2", "keyword3")
categories <- c("category1", "category2", "category3")

lookup_table <- data.frame(keywords, categories)

I would like that each time I have a new label, I check whether there is a category corresponding to it and if so, attach the category.

So for the following example below, there would be the value 'category1' attached to the first row in a new column:

new_labels <- c("keyword1 qefjhqek", "hfaef", "fihiz")

Help much appreciated!


Solution

  • Here just use str_extract to get the relevant text and join the reference table.

    keywords <- c("keyword1", "keyword2", "keyword3")
    categories <- c("category1", "category2", "category3")
    
    lookup_table <- data.frame(keywords, categories)
    new_labels <- c("keyword1 qefjhqek", "hfaef", "fihiz")
    
    library(data.table)
    library(tidyverse)
    ref_tbl <- 
    # data.table(
    # For the AntoniosK's sugguestion, recommend dplyr-like function.
    tibble(
        keywords = keywords
        ,categories = categories
    )
    
    # as.data.table(
    # For the AntoniosK's sugguestion, recommend dplyr-like function.
    as_tibble(
        new_labels
        ) %>% 
        mutate(ref_key = str_extract(new_labels
                                     # ,'keyword[:digit:]'
                                     ,(
                                       keywords %>% 
                                         str_flatten('|')
                                       # regular expression
                                     )
                                     )) %>% 
        left_join(
             ref_tbl
             ,by=c('ref_key'='keywords')
        )
    #> # A tibble: 3 x 3
    #>   value             ref_key  categories
    #>   <chr>             <chr>    <chr>     
    #> 1 keyword1 qefjhqek keyword1 category1 
    #> 2 hfaef             <NA>     <NA>      
    #> 3 fihiz             <NA>     <NA>
    

    Created on 2018-11-10 by the reprex package (v0.2.1)


    From @AntoniosK's question, I do the comparison between data.table and tibble. And the fact is there is a significant sign supporting tibble is better than data.table.

    1. tibble only 2990 ms -> 1st : enter image description here
    2. data.table and as.data.table 3240 ms -> 2nd : enter image description here
    3. data.table only 3840 ms -> 3rd : enter image description here