Why does my quanteda fcm() show zero frequencies?

Here I apply quanteda's fcm() function to data_corpus_inaugural:

library(dplyr)
library(tidyverse)
library(quanteda)

data_inaug <- data_corpus_inaugural

fcm_inaug <-  data_inaug %>% 
  tokens(remove_punct = TRUE,
         remove_symbols = TRUE,
         remove_numbers = TRUE,
         remove_url = TRUE) %>% 
  tokens_tolower() %>%
  fcm(context = "window", window = 5, count = "frequency")

I then select a subset from the fcm:

fcm_inaug[c("as", "for", "since", "because"), names(topfeatures(fcm_inaug))]

Now, lots of frequencies are zero:

Feature co-occurrence matrix of: 4 by 10 features.
         features
features  the our  we in and to is a  be for
  as        0 153 153  0   0  0 92 0 145  93
  for       0 210 165  0   0  0  0 0   0 110
  since     0   0   5  0   0  0  0 0   0   0
  because   0   0   0  0   0  0  0 0   0   0

However, many of the words with frequency 0 in fact occur in the five-word window around the conjunctions. And, for instance, occurs a total of fourteen times around since:

is noted to prove that  <<since>>   truth ***and*** reason have maintained
rather more than forty-four years   <<since>>   we declared our independence ***and***
maxim of our policy ever    <<since>>   the days of washington ***and***
exercise of our national sovereignty    <<since>>   freedom impelled ***and*** independence inspired
of public prosperity ***and*** felicity <<since>>   we ought to be no
heaven itself has ordained ***and***    <<since>>   the preservation of the sacred
***and*** ***and*** it has been <<since>>   the constant effort of the
declared our independence ***and*** thirty-seven    <<since>>   it was acknowledged the talents
our national infancy ***and*** has  <<since>>   upheld our liberties in various
than force of arms ***and***    <<since>>   it presents to the world
in the spanish war ***and***    <<since>>   has given it a position
one hundred ***and*** fiftieth year <<since>>   our national consciousness first asserted
world recovery ***and*** lasting peace  <<since>>   the end of hostilities the

Solution

It's because you should be selecting features from the fcm using fcm_select() rather than slicing it using the index operator.

Try this (and here, I've not used topfeatures() for an fcm object since that's removed in the forthcoming 4.0 release:

library(quanteda)
#> Package version: 4.0.0
#> Unicode version: 14.0
#> ICU version: 71.1
#> Parallel computing: 10 of 10 threads used.
#> See https://quanteda.io for tutorials and examples.

data_inaug <- data_corpus_inaugural

fcm_inaug <-  data_inaug |>
  tokens(remove_punct = TRUE,
         remove_symbols = TRUE,
         remove_numbers = TRUE,
         remove_url = TRUE) |> 
  tokens_tolower() |>
  fcm(context = "window", window = 5, count = "frequency")

topfeats <- rowSums(fcm_inaug) |>
  sort(decreasing = TRUE) |>
  head(10)

fcm_select(fcm_inaug, c("as", "for", "since", "because", names(topfeats)))
#> Feature co-occurrence matrix of: 14 by 14 features.
#>         features
#> features   of  the  and   to that    a   in  as  it  be
#>     of   2134 8617 2923 1759  714 1252 1461 382 481 488
#>     the     0 6212 3978 3309 1183 1081 2266 581 814 777
#>     and     0    0 1354 1632  527  755 1029 329 392 515
#>     to      0    0    0 1076  544  785  708 326 601 699
#>     that    0    0    0    0  128  278  302  92 268 281
#>     a       0    0    0    0    0  356  492 221 221 232
#>     in      0    0    0    0    0    0  490 161 274 236
#>     as      0    0    0    0    0    0    0 294 143 145
#>     it      0    0    0    0    0    0    0   0 170 329
#>     be      0    0    0    0    0    0    0   0   0 100
#> [ reached max_feat ... 4 more features, reached max_nfeat ... 4 more features ]

^{Created on 2024-02-19 with reprex v2.1.0}