Search code examples
rquantedanetwork-analysis

quanteda: Count number of edges for each node in a network plot


I have a network plot computed through the textplot_network() function of quanteda package. For a minimal, please refer to the official quanteda website here. What I am reporting below is just a copy-paste of what you can find in the link.

library(quanteda)
load("data/data_corpus_tweets.rda")
tweet_dfm <- dfm(data_corpus_tweets, remove_punct = TRUE)
tag_dfm <- dfm_select(tweet_dfm, pattern = ("#*"))
toptag <- names(topfeatures(tag_dfm, 50))
topgat_fcm <- fcm_select(tag_fcm, pattern = toptag)
textplot_network(topgat_fcm, min_freq = 0.1, edge_alpha = 0.8, edge_size = 5)

The resulting network plot is the following:

enter image description here

How do I calculate the number of edges for each of the node rendered in the plot? If I use the function topfeatures() applied over the fcm object topgat_fcm I obtain the top hubs of the network which are the counts of the co-occurrences detected.

Any ideas?

Thanks


Solution

  • The number of edges for any node will be the number of cells in the upper triangle, excluding the diagonal (since a feature's co-occurrence with another instance of itself in a document does not produce an "edge" in a plot).

    Let's approach this from a simpler example. I'll define a very simple three-document structure with a six word types.

    library("quanteda", warn.conflicts = FALSE)
    ## Package version: 1.4.0
    ## Parallel computing: 2 of 12 threads used.
    ## See https://quanteda.io for tutorials and examples.
    txt <- c("a b b c", "b d d e", "a e f f")
    fcmat <- fcm(txt)
    fcmat
    ## Feature co-occurrence matrix of: 6 by 6 features.
    ## 6 x 6 sparse Matrix of class "fcm"
    ##         features
    ## features a b c d e f
    ##        a 0 2 1 0 1 2
    ##        b 0 1 2 2 1 0
    ##        c 0 0 0 0 0 0
    ##        d 0 0 0 1 2 0
    ##        e 0 0 0 0 0 2
    ##        f 0 0 0 0 0 1
    

    Here, "a" has four edges, with "b", "c", "e", and "f". "b" has three edges, with "c", "d", and "e" (excluding "b"s co-occurrence with itself, in the first document).

    To get the counts, we can just sum the cells that are non-zero, which can happen using rowSums() or if you transpose the matrix, the equivalent function for computing "document" frequency (although here, the features are the "documents").

    Excluding self-edges, we can verify these edges by looking at the network plot for this fcm.

    rowSums(fcmat > 0)
    ## a b c d e f 
    ## 4 4 0 2 1 1
    docfreq(t(fcmat))
    ## a b c d e f 
    ## 4 4 0 2 1 1
    
    textplot_network(fcmat)
    

    To exclude the self-edge counts, we need to zero the diagonal. Currently, this will drop the class definition on the fcm, which means we will not be able to use it in textplot_network(), but we can still use our rowSums() approach to get the edge counts by node, providing the answer to your question.

    diag(fcmat) <- 0
    rowSums(fcmat > 0)
    ## a b c d e f 
    ## 4 3 0 1 1 0