Search code examples
typesquanteda

Correct way to count types in whole corpus


I struggle to find the correct way to count types (unique forms of words) in a Quanteda corpus. ntype() gives the number of types per document, but not for the corpus as a whole.

I found two ways to get this count, which give two different results and I don’t understand why.

Reproductible code:

library(quanteda)

corp_uk <- corpus(data_char_ukimmig2010)
corp_uk_tokens <- tokens(corp_uk, remove_punct = TRUE)

nfeat(dfm(corp_uk_tokens))
length(types(corp_uk_tokens))

nfeat(dfm(corp_uk_tokens)) outputs 1648

length(types(corp_uk_tokens)) outputs 1804

Which one is correct and why those two calculations don’t give the same result?

Thanks a lot for helping!


Solution

  • It's because dfm() has tolower = TRUE as a default, so the nfeat() has combined some types due to lowercasing. If you turn this off, you will get the same result as the length of the types().

    library(quanteda)
    #> Package version: 4.0.0
    #> Unicode version: 14.0
    #> ICU version: 71.1
    #> Parallel computing: disabled
    #> See https://quanteda.io for tutorials and examples.
    
    corp_uk <- corpus(data_char_ukimmig2010)
    corp_uk_tokens <- tokens(corp_uk, remove_punct = TRUE)
    
    # length of types vector
    length(types(corp_uk_tokens))
    #> [1] 1800
    
    # gives the types after lowercasing, default for dfm()
    nfeat(dfm(corp_uk_tokens))
    #> [1] 1644
    
    # without lowercasing, it's the same
    nfeat(dfm(corp_uk_tokens, tolower = FALSE))
    #> [1] 1800
    

    Created on 2024-03-28 with reprex v2.1.0