Search code examples
rexceltext-mining

Automated tagging/text mining in excel


I have a monthly excel spreadsheet with the following:

Category Description
A free text in paragraph form
B free text in paragraph form
C free text in paragraph form
B free text in paragraph form
B free text in paragraph form
A free text in paragraph form

I would like to add a third column that adds tags or keywords from a predetermined list that searches the free text and then pre-populates it based on whether one or more of the terms is found there or not.

So for example a list of tags could be price, distance, availability, location, and so on with the Keywords or Tags column populated based on the free text in the second column as below

Category Description Keywords or Tags
A Really doesn't like the price and location is too far price, location
B The distance is an issue and not too much availability Distance, availability
C Location is close so I like the convenience location, convenience
B The distance is near and there is a lot of availability availability, distance

As shown above, the tags would be separated by commas.

The issue is that the list of predetermined keywords is large (around 20 to 30 tags).

My Questions:

What would be the most efficient way to create this list without removing any tags?

Also, is there a way to do this in RStudio?


Solution

  • We can use regular expressions here to extract the keywords from the strings.

    If we put the keywords in a vector keywords, we can use the str_extract_all from the stringr package to extract all matching words in the string. I've made it into a simple function which we apply to the Description column of your data.frame, inserting the results into a new variable Keys

    library(stringr)
    
    get_tags <- function(str, tags) {
        res = str_extract_all(str,
                              regex(tags, ignore_case = T), # Search case insensitive
                              simplify = T)[,1] # Get result as vector, not matrix
        return(res[nchar(res) > 0])  # Drop empty strings from non-matched keywords
    }
    
    df$Keys <- sapply(df$Description,
                      function(x) paste0(get_tags(x, keywords),
                                         collapse=', ')) # Collapse matches w/ commas
    
    df
    
      Category                                             Description                   Keys
    1        A   Really doesn't like the price and location is too far        price, location
    2        B  The distance is an issue and not too much availability distance, availability
    3        C             Location is close so I like the convenience  Location, convenience
    4        D The distance is near and there is a lot of availability distance, availability
    
    

    Since you want the matches to be case insensitive, putting the regex pattern (tags) in the regex function allows us to specify that it should ignore case.