Search code examples
rfrequency

Display a table of strings and their variations per row (R)


For a large database, I would like to find a solution where I could predefine the strings to be searched and then get a table that would contain the frequency of these strings and their possible variations per row.

strings <- c("dog", "cat", "mouse")

var1 <- c("black dog", "white dog", "angry dog", "dogs and cats are nice", "dog")
var2 <- c("white cat", "black cat", "tiny cat", NA, "cow")
var3 <- c("little mouse", "big mouse", NA, NA, "mouse")
data <- data.frame(var1, var2, var3)

The result should look like this while I am looking for dog, cat and mouse:

dog&cat 4
mouse 3

Solution

  • We may try

    v1 <- do.call(paste, data)
    stack(setNames(lapply(c( "\\bdog.*\\bcat|\\bcat.*\\bdog", "mouse"), 
        \(pat) sum(grepl(pat, v1))), c("dog&cat", "mouse")))[2:1]
          ind values
    1 dog&cat      4
    2   mouse      3
    

    Or if we need all the combinations

    lst1 <- lapply(c(strings, combn(strings, 2, FUN = \(x) 
       sprintf("\\b%1$s.*\\b%2$s|\\b%2$s.*\\b%1$s", x[1], x[2]))), 
        \(pat) sum(grepl(pat, v1)))
    names(lst1) <- c(strings, combn(strings, 2, FUN = paste, collapse = "&"))
    stack(lst1)[2:1]
            ind values
    1       dog      5
    2       cat      4
    3     mouse      3
    4   dog&cat      4
    5 dog&mouse      3
    6 cat&mouse      2
    

    For more combinations, it may be better to use Reduce with individually applying grepl

    lst1 <- lapply(1:3, \(n) {
       vals <- colSums(combn(strings, n, 
      FUN = \(pats) Reduce(`&`, lapply(pats, \(pat) grepl(pat, v1)))))
       nms <- combn(strings, n, FUN = paste, collapse = "&")
       setNames(vals, nms)
       })
    stack(unlist(lst1))[2:1]
                ind values
    1           dog      5
    2           cat      4
    3         mouse      3
    4       dog&cat      4
    5     dog&mouse      3
    6     cat&mouse      2
    7 dog&cat&mouse      2
    

    Or with tidyverse

    library(dplyr)
    library(stringr)
    library(tidyr)
    data %>% 
      unite(var, everything(), na.rm = TRUE, sep = " ") %>% 
      summarise(`dog&cat` = sum(str_detect(var,
       "\\bdog.*\\bcat|\\bcat.*\\bdog")),
         mouse = sum(str_detect(var, 'mouse'))) %>%   
      pivot_longer(everything())
    
    

    -output

     # A tibble: 2 × 2
      name    value
      <chr>   <int>
    1 dog&cat     4
    2 mouse       3