Search code examples
rstringmax

Find most common word(s) in character string value


I have data that looks like

df <- data.frame(A = c("a, a, a, b, b, c, c", "a, a, b, b, b, b, c", "a, a, b, b"), B = c(3, 5, 8))

I want to find the most common word, separated by , for each observation of variable A.

All approaches I have found only extract the most common word in the entire column, such as

table(unlist(strsplit(df$A,", "))) %>% which.max() %>% names()

and I get

wrong_result <- data.frame(A = c("a, a, a, b, b, c, c", "a, a, b, b, b, b, c"), B = c(3, 5, 8), C = c("b", "b", "b"))

If two words are equally frequent they should both be extracted. The result should look like

result <- data.frame(A = c("a, a, a, b, b, c, c", "a, a, b, b, b, b, c", "a, a, b, b"), B = c(3, 5, 8), C = c("a", "b", "a, b"))

Solution

  • You can do:

    library(dplyr)
    library(stringr)
    library(purrr)
    df %>% 
      mutate(maxi = map(str_split(A, pattern = ", "), 
                        ~ toString(names(which(table(.x) == max(table(.x)))))))
    
    #                    A B maxi
    #1 a, a, a, b, b, c, c 3    a
    #2 a, a, b, b, b, b, c 5    b
    #3          a, a, b, b 8 a, b