Search code examples
rdplyrtidytext

Count co-occurrences of two words but the order is not important in r


WHAT I WANT: I want to count co-occurrence of two words. But I don't care the order they appear in the string.

MY PROBLEM: I don't know how to deal When two given words appear in different order.

SO FAR: I use unnest_token function to split the string by words using the "skip_ngrams" option for the token argument. Then I filtered the combination of exactly two words. I use separate to create word1 and word2 columns. Finally, I count the occurrence.

The output that I get is like this:

# A tibble: 3 × 3
  word1 word2      n
  <chr> <chr>  <dbl>
1 a     c          3
2 b     a          1
3 c     a          5

But words "a" and "c" occur in a different order so they are counted as a different element. What I want is this:

# A tibble: 2 × 3
  word1 word2      n
  <chr> <chr>  <dbl>
1 a     c          8
2 b     a          1

MY DATA: My data looks like this and this is the whole process with different data but the same problem. In this case "a b" and "c a" should take a value of n = 2.

library(tidyverse)
library(tidytext)
enframe(c("a b c a d e")) %>% 
  unnest_tokens(skipgram, value, token = "skip_ngrams", n = 5) %>% 
  mutate(n_words = str_count(skipgram, pattern = "\\S+")) %>%
  filter(n_words == 2) %>% 
  separate(col = skipgram, into = c("word1", "word2"), sep = "\\s+") %>% 
  count(word1, word2) 
#> # A tibble: 9 × 3
#>   word1 word2     n
#>   <chr> <chr> <int>
#> 1 a     b         1
#> 2 a     c         1
#> 3 a     d         1
#> 4 a     e         1
#> 5 b     a         1
#> 6 b     c         1
#> 7 c     a         1
#> 8 c     d         1
#> 9 d     e         1

Created on 2022-02-09 by the reprex package (v2.0.1)


Solution

  • We may use pmin/pmax to sort the columns by row before applying the count

    library(tidytext)
    library(dplyr)
    library(stringr)
    library(tidyr)
    enframe(c("a b c a d e")) %>% 
      unnest_tokens(skipgram, value, token = "skip_ngrams", n = 5) %>% 
      mutate(n_words = str_count(skipgram, pattern = "\\S+")) %>%
      filter(n_words == 2) %>% 
      separate(col = skipgram, into = c("word1", "word2"), 
          sep = "\\s+") %>%
      transmute(word11 = pmin(word1, word2), word22 = pmax(word1, word2)) %>%
      count(word11, word22)
    

    -output

    # A tibble: 7 × 3
      word11 word22     n
      <chr>  <chr>  <int>
    1 a      b          2
    2 a      c          2
    3 a      d          1
    4 a      e          1
    5 b      c          1
    6 c      d          1
    7 d      e          1