Search code examples
rtmword-frequency

R Text Mining - the most frequent word in string across entire data frame


I am struggling to grasp text mining and determine word frequencies. I am just starting to have an understanding of R and its packages and I just find out about tm (after reading a while I have a feeling that this might solve my problem).

My question is: how can I determine the two most frequently used in a string across the entire column?

I have the following example:

    structure(list(Location = c("Chicago", "Chicago", "Chicago", 
"LA", "LA", "LA", "LA", "LA", "LA", "Texas", "Texas", "Texas", 
"Texas", "Texas"), Code = c(4450L, 4450L, 4450L, 4450L, 4450L, 
4450L, 4450L, 4450L, 4450L, 4410L, 4410L, 4410L, 4410L, 4410L
), Description = c("LABOR - CROSSOVER BOARD BRACKET", "LABOR - CROWN DOOR GASKET", 
"LABOR - CROWN DOOR GASKET - APPLY 4' NEW GASKET AND RE-APPLY", 
"LABOR - CUSHIONING DEVICE - END OF CAR CUSTOMER SUPPLIED MATERIAL", 
"LABOR - DOOR EDGE", "LABOR - DOOR GASKET, CROWN CORNER", "LABOR - DOOR LOCK POCKET STG", 
"LABOR - DOOR LOCK RECEPTICALS STG", "LABOR - DOOR LOCK STG", 
"BOLT, HT, UNDER 5/8\"\" DIA & 6\"\" - SIDE POST", "BOLT, HT, UNDER 5/8\"\" DIA & 6\"\" - TRAINLINE TROLLEY", 
"BOLT,HT,5/8 IN.DIA.OR LESS UNDER 6\"\" LONG - BRAKE STEP", "BOLT,HT,5/8 IN.DIA.OR LESS UNDER 6\"\" LONG - CROSSOVER BOARD", 
"BOLT,HT,5/8 IN.DIA.OR LESS UNDER 6\"\" LONG - CROSSOVER BOARD BRACKET"
), `Desired Description Based on frequency` = c("Labor - CROWN DOOR GASKET", 
"Labor - CROWN DOOR GASKET", "Labor - CROWN DOOR GASKET", "Labor - DOOR LOCK", 
"Labor - DOOR LOCK", "Labor - DOOR LOCK", "Labor - DOOR LOCK", 
"Labor - DOOR LOCK", "Labor - DOOR LOCK", "Bolt - HT", "Bolt - HT", 
"Bolt - HT", "Bolt - HT", "Bolt - HT")), .Names = c("Location", 
"Code", "Description", "Desired Description Based on frequency"
), row.names = c(NA, -14L), class = "data.frame")

In the end I wish I could add this column:

Desired Description Based on frequency
Labor - CROWN DOOR GASKET
Labor - CROWN DOOR GASKET
Labor - CROWN DOOR GASKET
Labor - DOOR LOCK
Labor - DOOR LOCK
Labor - DOOR LOCK
Labor - DOOR LOCK
Labor - DOOR LOCK
Labor - DOOR LOCK
Bolt - HT
Bolt - HT
Bolt - HT
Bolt - HT
Bolt - HT

Basically I want to evaluate all the 4450 or 4410s and see out of all the description in the table, which the most common and add that as a column. Another condition would be based on the location. Can someone please help me with a simple example?

Thank you so much


Solution

  • I don't think there's a one-size-fits-it-all-solution to your problem. (Beginning with the fact that there's no exact rule on which or how many words to take for the description.) However, here are two quick&dirty approaches, which might be helpful as a starting point:

    library(tm)
    txts <- gsub("[^A-Z]", " ", df$Description)
    groups <- paste(df$Location, df$Code)
    
    # 1
    opts <- list(tolower=F, removePunctuation=TRUE, wordLengths=c(2, Inf))
    lst <- split(txts, groups)
    res <- sapply(lst, function(x) { 
      freq <- termFreq(paste(x, collapse=" "), opts)/length(x)
      paste(names(freq[rank(-freq, ties.method = "first")<=3]), collapse = " - ")
    })
    rep(res, lengths(lst))
    
    # 2 
    lst <- lapply(strsplit(txts, "\\s+"), function(x) x[1:min(c(3,length(x)))] )
    lst <- split(lst, groups)
    n <- lengths(lst)
    lst <- mapply("/", lapply(lst, function(x) sort(table(unlist(x)), decreasing = T)), n)
    rep(sapply(lst, function(x) paste(names(x)[x>=.5], collapse=" - ")), n)