Search code examples
rtokenize

Keyword Search from separate dataframe


I have some R code below with examples of the 2 dataframes that I'm working with. The "keywords" df is periodically changing, so I need to create some code that will FLAG the rows within "mydata" where the Segment matches and the mydata$Acct_Name only needs to contain the word from keywords$KEYWORD somewhere within the cell.

I started doing a FOR loop, but things get tricky quickly when you're dealing with grepl and multiple dataframes. My next thought was to parse mydata$Acct_Name to then attempt a merge between the 2 dfs.

Any help is much appreciated!

dummy data

Acct_Name <- c('joes ski shop'
               ,'joes alarm shop'
               ,'joes alarm spot'
               ,'joes bakery'
               ,'joes albergue shop'
               ,'jims Brewery'
               ,'jims albergue place'
               )
Segment <- c('All_Other'
             ,'All_Other'
             ,'All_Other'
             ,'All_Other'
             ,'Apartments'
             ,'Apartments'
             ,'Apartments'
             )

mydata <- data.frame(Acct_Name, Segment)

mydata$Acct_Name <- as.character(mydata$Acct_Name)
mydata$Segment <- as.character(mydata$Segment)


Segment <- c('All_Other'
             ,'All_Other'
             ,'All_Other'
             ,'Apartments'
             ,'Apartments'
             ,'Apartments'
             ,'Apartments'
)
KEYWORD <- c('aislamiento'
             ,'alarm'
             ,'alarma'
             ,'albergue'
             ,'alcantarilla cloaca'
             ,'alcohol'
             ,'almacenamiento'
)

keywords <- data.frame(Segment,KEYWORD)
keywords$FLAG <- 1
keywords$Segment <- as.character(keywords$Segment)
keywords$KEYWORD <- as.character(keywords$KEYWORD)

Solution

  • You want to find any of the keywords for the group within the entries in mydata for that group. We can basically just collapse each group to one or condition, using paste, with the collapse = "|" specified. Then do a merge and make a new column of results using grepl. Using data.table:

    library(data.table)
    # make the conditions, collapsing by group
    kwords <- as.data.table(keywords)[, KWORD := paste(KEYWORD, collapse = "|"), by = Segment
      ][, .SD[1], by = Segment, .SDcols = c("KWORD")]
    
    # make a column based on the grepl with condition
    mydata <- as.data.table(mydata)
    kwords[mydata, on = "Segment"][, flag := grepl(KWORD, Acct_Name), by = Acct_Name][]
    
    # output:
    # Segment                                               KWORD           Acct_Name  flag
    # 1:  All_Other                            aislamiento|alarm|alarma       joes ski shop FALSE
    # 2:  All_Other                            aislamiento|alarm|alarma     joes alarm shop  TRUE
    # 3:  All_Other                            aislamiento|alarm|alarma     joes alarm spot  TRUE
    # 4:  All_Other                            aislamiento|alarm|alarma         joes bakery FALSE
    # 5: Apartments albergue|alcantarilla cloaca|alcohol|almacenamiento  joes albergue shop  TRUE
    # 6: Apartments albergue|alcantarilla cloaca|alcohol|almacenamiento        jims Brewery FALSE
    # 7: Apartments albergue|alcantarilla cloaca|alcohol|almacenamiento jims albergue place  TRUE
    

    EDIT: Another option, that might work when there are many keywords per group, is to use stringr::str_detect, which is vectorized over pattern. Like so:

    as.data.table(mydata)[, flag := any(
      stringr::str_detect(Acct_Name, 
                          keywords[keywords$Segment == Segment,"KEYWORD"])), 
      by = Acct_Name][]
    
    # Acct_Name    Segment  flag
    # 1:       joes ski shop  All_Other FALSE
    # 2:     joes alarm shop  All_Other  TRUE
    # 3:     joes alarm spot  All_Other  TRUE
    # 4:         joes bakery  All_Other FALSE
    # 5:  joes albergue shop Apartments  TRUE
    # 6:        jims Brewery Apartments FALSE
    # 7: jims albergue place Apartments  TRUE
    

    We want to see if, for the subset of keywords such that keywords$Segment == mydata$Segment, any of the patterns from keywords$KEYWORD can be found with str_detect in mydata$Acct_Name. This solution seems a little funky to me, because it mixes a few different methods of referencing columns and mixes data.frame and data.table, but it seems to work. Perhaps this will work with the size of your original data.

    Or, instead of having to repeatedly subset, make a list beforehand and use that (using split to split the data.frame, the equivalent of subsetting here, and lapply to get only the KEYWORD column):

    keywords.list <- lapply(split(keywords, keywords$Segment), function(x) x$KEYWORD)
    

    Then refer to each conveniently named list item by the value of Segment in mydata:

    as.data.table(mydata)[, flag := any(
      stringr::str_detect(
        Acct_Name, 
        keywords.list[[Segment]])), by = Acct_Name][]
    
    # Acct_Name    Segment  flag
    # 1:       joes ski shop  All_Other FALSE
    # 2:     joes alarm shop  All_Other  TRUE
    # 3:     joes alarm spot  All_Other  TRUE
    # 4:         joes bakery  All_Other FALSE
    # 5:  joes albergue shop Apartments  TRUE
    # 6:        jims Brewery Apartments FALSE
    # 7: jims albergue place Apartments  TRUE