Search code examples
rdplyrtidyrquantedatidytext

Compare feature co-Occurrence against significant co-occurrences


I would like to understand the practical differences of following cases:

  1. Use function fcm(objectname # generate feature co-occurrence matrix to calculate the absolute frequenies. Finally plot with function textplot_network().
  2. I read tutorials like tidytextmining or a tutorial written by Andreas Niekler and Gregor Wiedemann who use igraph or widyr package. I want to plot correlated word pairs. Inspirated by tidytextmining tutorial which use the phi coefficient I will plot this correlation according the lambda coefficient.

I don't know how to plot the correlated word pairs with package quanteda. My idea is (maybe is not an efficient way) to compute textstat_collocations() and transform it to a tibble object and plot it with the functions of the widyr package. My open questions are: How can I split column collocation into two separate columns like item1 item2 and add select column lambda and save it and assign to a tibble object?

> head(sotu_collocations,1)
                collocation count count_nested length   lambda        z
1                smart city   229            0      2 9.846542 51.78172

Solution

  • Like this? Remove the select() command if you prefer to keep all of the columns.

    library("quanteda")
    ## Package version: 2.1.2
    
    colls <- textstat_collocations(data_corpus_inaugural[1:5], size = 2)
    head(colls)
    ##   collocation count count_nested length   lambda        z
    ## 1      of the    98            0      2 1.494207 11.89704
    ## 2    has been     9            0      2 5.691667 11.61596
    ## 3      i have    15            0      2 3.754144 11.51091
    ## 4      may be    14            0      2 4.072366 11.43632
    ## 5   have been    10            0      2 4.679873 10.94315
    ## 6     we have     9            0      2 4.458284 10.35023
    
    as.data.frame(colls) %>%
      tidyr::separate("collocation", into = c("word1", "word2"), sep = " ") %>%
      dplyr::select(word1, word2, lambda) %>%
      tibble::tibble()
    ## # A tibble: 678 x 3
    ##    word1   word2   lambda
    ##    <chr>   <chr>    <dbl>
    ##  1 of      the       1.49
    ##  2 has     been      5.69
    ##  3 i       have      3.75
    ##  4 may     be        4.07
    ##  5 have    been      4.68
    ##  6 we      have      4.46
    ##  7 foreign nations   6.32
    ##  8 it      is        3.50
    ##  9 my      country   4.49
    ## 10 united  states    7.22
    ## # … with 668 more rows