Here I apply quanteda's fcm() function to data_corpus_inaugural:
library(dplyr)
library(tidyverse)
library(quanteda)
data_inaug <- data_corpus_inaugural
fcm_inaug <- data_inaug %>%
tokens(remove_punct = TRUE,
remove_symbols = TRUE,
remove_numbers = TRUE,
remove_url = TRUE) %>%
tokens_tolower() %>%
fcm(context = "window", window = 5, count = "frequency")
I then select a subset from the fcm:
fcm_inaug[c("as", "for", "since", "because"), names(topfeatures(fcm_inaug))]
Now, lots of frequencies are zero:
Feature co-occurrence matrix of: 4 by 10 features.
features
features the our we in and to is a be for
as 0 153 153 0 0 0 92 0 145 93
for 0 210 165 0 0 0 0 0 0 110
since 0 0 5 0 0 0 0 0 0 0
because 0 0 0 0 0 0 0 0 0 0
However, many of the words with frequency 0 in fact occur in the five-word window around the conjunctions. And, for instance, occurs a total of fourteen times around since:
is noted to prove that <<since>> truth ***and*** reason have maintained
rather more than forty-four years <<since>> we declared our independence ***and***
maxim of our policy ever <<since>> the days of washington ***and***
exercise of our national sovereignty <<since>> freedom impelled ***and*** independence inspired
of public prosperity ***and*** felicity <<since>> we ought to be no
heaven itself has ordained ***and*** <<since>> the preservation of the sacred
***and*** ***and*** it has been <<since>> the constant effort of the
declared our independence ***and*** thirty-seven <<since>> it was acknowledged the talents
our national infancy ***and*** has <<since>> upheld our liberties in various
than force of arms ***and*** <<since>> it presents to the world
in the spanish war ***and*** <<since>> has given it a position
one hundred ***and*** fiftieth year <<since>> our national consciousness first asserted
world recovery ***and*** lasting peace <<since>> the end of hostilities the
It's because you should be selecting features from the fcm using fcm_select()
rather than slicing it using the index operator.
Try this (and here, I've not used topfeatures()
for an fcm object since that's removed in the forthcoming 4.0 release:
library(quanteda)
#> Package version: 4.0.0
#> Unicode version: 14.0
#> ICU version: 71.1
#> Parallel computing: 10 of 10 threads used.
#> See https://quanteda.io for tutorials and examples.
data_inaug <- data_corpus_inaugural
fcm_inaug <- data_inaug |>
tokens(remove_punct = TRUE,
remove_symbols = TRUE,
remove_numbers = TRUE,
remove_url = TRUE) |>
tokens_tolower() |>
fcm(context = "window", window = 5, count = "frequency")
topfeats <- rowSums(fcm_inaug) |>
sort(decreasing = TRUE) |>
head(10)
fcm_select(fcm_inaug, c("as", "for", "since", "because", names(topfeats)))
#> Feature co-occurrence matrix of: 14 by 14 features.
#> features
#> features of the and to that a in as it be
#> of 2134 8617 2923 1759 714 1252 1461 382 481 488
#> the 0 6212 3978 3309 1183 1081 2266 581 814 777
#> and 0 0 1354 1632 527 755 1029 329 392 515
#> to 0 0 0 1076 544 785 708 326 601 699
#> that 0 0 0 0 128 278 302 92 268 281
#> a 0 0 0 0 0 356 492 221 221 232
#> in 0 0 0 0 0 0 490 161 274 236
#> as 0 0 0 0 0 0 0 294 143 145
#> it 0 0 0 0 0 0 0 0 170 329
#> be 0 0 0 0 0 0 0 0 0 100
#> [ reached max_feat ... 4 more features, reached max_nfeat ... 4 more features ]
Created on 2024-02-19 with reprex v2.1.0