Search code examples
rtidytext

Can I combine pairwise_cor and pairwise_count to get the phi coefficient AND number of occurrences for each pair of words?


I'm new to R, and I'm using widyr to do text mining. I successfully used the methods found here to get a list of co-occurring words within each section of text and their phi coefficient.

Code as follows:

word_cors <- review_words %>%
  group_by(word) %>%
  pairwise_cor(word, title, sort = TRUE) %>%
  filter(correlation > .15)

I understand that I can also generate a data frame with co-occurring words and the number of times they appear, using code like:

word_pairs <- review_words %>%
  pairwise_count(word, title, sort = TRUE)

What I need is a table that has both the phi coefficient and the number of occurrences for each pair of words. I've been digging into pairwise_cor and pairwise_count but still can't figure out how to combine them. If I understand correctly, joins only take one column into account for matching, so I couldn't use a regular join reliably since there may be multiple pairs that have the same word in the item1 column.

Is this possible using widyr? If not, is there another package that will allow me to do this?

Here is the full code:

#Load packages
pacman::p_load(XML, dplyr, stringr, rvest, httr, xml2, tidytext, tidyverse, widyr)

#Load source material
prod_reviews_df <- read_csv("SOURCE SPREADSHEET.csv")

#Split into one word per row
review_words <- prod_reviews_df %>%
  unnest_tokens(word, comments, token = "words", format = "text", drop = FALSE) %>%
  anti_join(stop_words, by = c("word" = "word"))

#Find phi coefficient
word_cors <- review_words %>%
  group_by(word) %>%
  pairwise_cor(word, title, sort = TRUE) %>%
  filter(correlation > .15)

#Write data to CSV
write.csv(word_cors, "WORD CORRELATIONS.csv")

I want to add in pairwise_count, but I need it alongside the phi coefficient.

Thank you!


Solution

  • If you are getting into using tidy data principles and tidyverse tools, I would suggest GOING ALL THE WAY :) and using dplyr to do the joins you are interested in. You can use left_join to connect the calculations from pairwise_cor() and pairwise_count(), and you can just pipe from one to the other, if you like.

    library(dplyr)
    library(tidytext)
    library(janeaustenr)
    library(widyr)
    
    austen_section_words <- austen_books() %>%
      filter(book == "Pride & Prejudice") %>%
      mutate(section = row_number() %/% 10) %>%
      filter(section > 0) %>%
      unnest_tokens(word, text) %>%
      filter(!word %in% stop_words$word)
    
    austen_section_words %>%
      group_by(word) %>%
      filter(n() >= 20) %>%
      pairwise_cor(word, section, sort = TRUE) %>%
      left_join(austen_section_words %>%
                  pairwise_count(word, section, sort = TRUE),
                by = c("item1", "item2"))
    
    #> # A tibble: 154,842 x 4
    #>        item1     item2 correlation     n
    #>        <chr>     <chr>       <dbl> <dbl>
    #>  1    bourgh        de   0.9508501    29
    #>  2        de    bourgh   0.9508501    29
    #>  3    pounds  thousand   0.7005808    17
    #>  4  thousand    pounds   0.7005808    17
    #>  5   william       sir   0.6644719    31
    #>  6       sir   william   0.6644719    31
    #>  7 catherine      lady   0.6633048    82
    #>  8      lady catherine   0.6633048    82
    #>  9   forster   colonel   0.6220950    27
    #> 10   colonel   forster   0.6220950    27
    #> # ... with 154,832 more rows