Search code examples
rquanteda

Why are results different for column / row of a Quanteda freq. co-occurence matrix?


I am trying to use Quanteda to count the number of times different terms co-occur with a specific term (eg. Vietnam or "越南") in a quarter.

But when I select either a column or row from the frequency co-occurence matrix, the counts are different.

Could anybody tell me why this is or what I'm doing wrong? I'm worried my analysis based on these results is incorrect.

##Producing the FCM

> corp <- corpus(data_SCS14q4)
> toks <- tokens(corp, remove_punct = TRUE) %>%  tokens_remove(ch_stop) %>% tokens_compound(phrase("东 盟"), concatenator = "") 
> fcm_14q4 <- fcm(toks, context = "window")

##taking the row for Vietnam or "越南":

mt <- fcm_14q4["越南",]
> head(mt)

Feature co-occurrence matrix of: 1 by 6 features.
        features
features 印 司令 中国 2050 收复 台湾
    越南  0    0    0    0    0    0

##Taking the column for Vietnam or "越南":

> mt2 <- fcm_14q4[,"越南"]
> head(mt2)

Feature co-occurrence matrix of: 6 by 1 feature.
        features
features 越南
    印      0
    司令    0
    中国   68
    2050    0
    收复    8
    台湾    4


Solution

  • This is because by default, fcm() returns only the upper triangle of the symmetric co-occurrence matrix (symmetric when ordered = FALSE). To make the two index slices equivalent, you would need to specify tri = FALSE.

    library("quanteda")
    ## Package version: 3.1
    ## Unicode version: 13.0
    ## ICU version: 69.1
    ## Parallel computing: 12 of 12 threads used.
    ## See https://quanteda.io for tutorials and examples.
    
    toks <- tokens(c("a a a b b c", "a a c e", "a c e f g"))
    
    # default is only upper triangle
    fcm(toks, context = "window", window = 2, tri = TRUE)
    ## Feature co-occurrence matrix of: 6 by 6 features.
    ##         features
    ## features a b c e f g
    ##        a 8 3 3 2 0 0
    ##        b 0 2 2 0 0 0
    ##        c 0 0 0 2 1 0
    ##        e 0 0 0 0 1 1
    ##        f 0 0 0 0 0 1
    ##        g 0 0 0 0 0 0
    

    This can make it symmetric in which case the index slicing is the same:

    fcmat2 <- fcm(toks, context = "window", window = 2, tri = FALSE)
    fcmat2
    ## Feature co-occurrence matrix of: 6 by 6 features.
    ##         features
    ## features a b c e f g
    ##        a 8 3 3 2 0 0
    ##        b 3 2 2 0 0 0
    ##        c 3 2 0 2 1 0
    ##        e 2 0 2 0 1 1
    ##        f 0 0 1 1 0 1
    ##        g 0 0 0 1 1 0
    
    fcmat2[, "a"]
    ## Feature co-occurrence matrix of: 6 by 1 features.
    ##         features
    ## features a
    ##        a 8
    ##        b 3
    ##        c 3
    ##        e 2
    ##        f 0
    ##        g 0
    t(fcmat2["a", ])
    ## Feature co-occurrence matrix of: 6 by 1 features.
    ##         features
    ## features a
    ##        a 8
    ##        b 3
    ##        c 3
    ##        e 2
    ##        f 0
    ##        g 0