I struggle to find the correct way to count types (unique forms of words) in a Quanteda corpus. ntype()
gives the number of types per document, but not for the corpus as a whole.
I found two ways to get this count, which give two different results and I don’t understand why.
Reproductible code:
library(quanteda)
corp_uk <- corpus(data_char_ukimmig2010)
corp_uk_tokens <- tokens(corp_uk, remove_punct = TRUE)
nfeat(dfm(corp_uk_tokens))
length(types(corp_uk_tokens))
nfeat(dfm(corp_uk_tokens))
outputs 1648
length(types(corp_uk_tokens))
outputs 1804
Which one is correct and why those two calculations don’t give the same result?
Thanks a lot for helping!
It's because dfm()
has tolower = TRUE
as a default, so the nfeat()
has combined some types due to lowercasing. If you turn this off, you will get the same result as the length of the types()
.
library(quanteda)
#> Package version: 4.0.0
#> Unicode version: 14.0
#> ICU version: 71.1
#> Parallel computing: disabled
#> See https://quanteda.io for tutorials and examples.
corp_uk <- corpus(data_char_ukimmig2010)
corp_uk_tokens <- tokens(corp_uk, remove_punct = TRUE)
# length of types vector
length(types(corp_uk_tokens))
#> [1] 1800
# gives the types after lowercasing, default for dfm()
nfeat(dfm(corp_uk_tokens))
#> [1] 1644
# without lowercasing, it's the same
nfeat(dfm(corp_uk_tokens, tolower = FALSE))
#> [1] 1800
Created on 2024-03-28 with reprex v2.1.0