Search code examples
rquanteda

Quanteda what does the variable Types mean that is returned by summary(corpus)?


I was studying the quanteda package from R and I just could not find from the documents what the variable called Types that is returned by summary(immig_corp) means.

require(quanteda)
require(readtext)

Now I create the corpus:

immig_corp <- corpus(data_char_ukimmig2010, 
                 docvars = data.frame(party = names(data_char_ukimmig2010)))

Now I would like to display some information about the corpus I have just created. Types is one of the generic attributes always given by the summary(corpus).

summary(immig_corp)

This bit returns me the following:

Corpus consisting of 9 documents:

         Text Types Tokens Sentences        party
         BNP  1125   3280        88          BNP
   Coalition   142    260         4    Coalition
Conservative   251    499        15 Conservative
      Greens   322    679        21       Greens
      Labour   298    683        29       Labour
      LibDem   251    483        14       LibDem
          PC    77    114         5           PC
         SNP    88    134         4          SNP
        UKIP   346    723        27         UKIP

Solution

  • Let's just concentrate on immig_corp <- corpus(data_char_ukimmig2010). This returns the following:

    Corpus consisting of 9 documents:
    
             Text Types Tokens Sentences
              BNP  1125   3280        88
        Coalition   142    260         4
     Conservative   251    499        15
           Greens   322    679        21
           Labour   298    683        29
           LibDem   251    483        14
               PC    77    114         5
              SNP    88    134         4
             UKIP   346    723        27
    

    Now Text is the document name. Sentences is the number of sentences in the document. Tokens is the number of tokens in the text and Types is the number of unique tokens in the text. So for BNP there are 1125 unique tokens, 3280 tokens and 88 sentences.

    You can recreate the counts as follows:

    # Sentences
    nsentence(immig_corp)
             BNP    Coalition Conservative       Greens       Labour       LibDem           PC          SNP         UKIP 
              88            4           15           21           29           14            5            4           27 
    
    # Tokens
    ntoken(immig_corp)
             BNP    Coalition Conservative       Greens       Labour       LibDem           PC          SNP         UKIP 
            3280          260          499          679          683          483          114          134          723 
    
    # Types
    ntype(immig_corp)
             BNP    Coalition Conservative       Greens       Labour       LibDem           PC          SNP         UKIP 
            1125          142          251          322          298          251           77           88          346