Search code examples
rquanteda

Why does my quanteda fcm() show zero frequencies?


Here I apply quanteda's fcm() function to data_corpus_inaugural:

library(dplyr)
library(tidyverse)
library(quanteda)

data_inaug <- data_corpus_inaugural

fcm_inaug <-  data_inaug %>% 
  tokens(remove_punct = TRUE,
         remove_symbols = TRUE,
         remove_numbers = TRUE,
         remove_url = TRUE) %>% 
  tokens_tolower() %>%
  fcm(context = "window", window = 5, count = "frequency")

I then select a subset from the fcm:

fcm_inaug[c("as", "for", "since", "because"), names(topfeatures(fcm_inaug))]

Now, lots of frequencies are zero:

Feature co-occurrence matrix of: 4 by 10 features.
         features
features  the our  we in and to is a  be for
  as        0 153 153  0   0  0 92 0 145  93
  for       0 210 165  0   0  0  0 0   0 110
  since     0   0   5  0   0  0  0 0   0   0
  because   0   0   0  0   0  0  0 0   0   0

However, many of the words with frequency 0 in fact occur in the five-word window around the conjunctions. And, for instance, occurs a total of fourteen times around since:

is noted to prove that  <<since>>   truth ***and*** reason have maintained
rather more than forty-four years   <<since>>   we declared our independence ***and***
maxim of our policy ever    <<since>>   the days of washington ***and***
exercise of our national sovereignty    <<since>>   freedom impelled ***and*** independence inspired
of public prosperity ***and*** felicity <<since>>   we ought to be no
heaven itself has ordained ***and***    <<since>>   the preservation of the sacred
***and*** ***and*** it has been <<since>>   the constant effort of the
declared our independence ***and*** thirty-seven    <<since>>   it was acknowledged the talents
our national infancy ***and*** has  <<since>>   upheld our liberties in various
than force of arms ***and***    <<since>>   it presents to the world
in the spanish war ***and***    <<since>>   has given it a position
one hundred ***and*** fiftieth year <<since>>   our national consciousness first asserted
world recovery ***and*** lasting peace  <<since>>   the end of hostilities the


Solution

  • It's because you should be selecting features from the fcm using fcm_select() rather than slicing it using the index operator.

    Try this (and here, I've not used topfeatures() for an fcm object since that's removed in the forthcoming 4.0 release:

    library(quanteda)
    #> Package version: 4.0.0
    #> Unicode version: 14.0
    #> ICU version: 71.1
    #> Parallel computing: 10 of 10 threads used.
    #> See https://quanteda.io for tutorials and examples.
    
    data_inaug <- data_corpus_inaugural
    
    fcm_inaug <-  data_inaug |>
      tokens(remove_punct = TRUE,
             remove_symbols = TRUE,
             remove_numbers = TRUE,
             remove_url = TRUE) |> 
      tokens_tolower() |>
      fcm(context = "window", window = 5, count = "frequency")
    
    topfeats <- rowSums(fcm_inaug) |>
      sort(decreasing = TRUE) |>
      head(10)
    
    fcm_select(fcm_inaug, c("as", "for", "since", "because", names(topfeats)))
    #> Feature co-occurrence matrix of: 14 by 14 features.
    #>         features
    #> features   of  the  and   to that    a   in  as  it  be
    #>     of   2134 8617 2923 1759  714 1252 1461 382 481 488
    #>     the     0 6212 3978 3309 1183 1081 2266 581 814 777
    #>     and     0    0 1354 1632  527  755 1029 329 392 515
    #>     to      0    0    0 1076  544  785  708 326 601 699
    #>     that    0    0    0    0  128  278  302  92 268 281
    #>     a       0    0    0    0    0  356  492 221 221 232
    #>     in      0    0    0    0    0    0  490 161 274 236
    #>     as      0    0    0    0    0    0    0 294 143 145
    #>     it      0    0    0    0    0    0    0   0 170 329
    #>     be      0    0    0    0    0    0    0   0   0 100
    #> [ reached max_feat ... 4 more features, reached max_nfeat ... 4 more features ]
    

    Created on 2024-02-19 with reprex v2.1.0