Search code examples
rquanteda

Count number of tokens per year


I wrote a small R script. Input are text files (thousands of journal articles). I generated the metadata (including the publication year) from the file names. Now I want to calculate the total number of tokens per year. However, I am not getting anywhere here.

# Metadata from filenames
rawdata_SPARA <- readtext("SPARA_paragraphs/*.txt", docvarsfrom = "filenames", dvsep="_", 
                        docvarnames = c("Unit", "Year", "Volume", "Issue")) 
# we add some more metadata columns to the data frame
rawdata_SPARA$Year <- substr(rawdata_SPARA$Year, 0, 4)
# Corpus
SPARA_corp <- corpus(rawdata_SPARA)

Does anyone here know a solution?

I used tokens_by function of the quanteda package which seems to be outdated.


Solution

  • You do not need to substring substr(rawdata_SPARA$Year, 0, 4). While calling readtext function, it extracts the year from the file name. In the example below the file names have the structure like EU_euro_2004_de_PSE.txt and automatically 2004 will be inserted into readtext object. As it inherits from data.frame you can use standard data manipulation functions, e.g. from dplyr package.

    Then just group_by by year and summarize by tokens. Number of tokens was calculated by quantedas ntoken function.

    See the code below:

    library(readtext)
    library(quanteda)
    
    # Prepare sample corpus
    set.seed(123)
    DATA_DIR <- system.file("extdata/", package = "readtext")
    rt <- readtext(paste0(DATA_DIR, "/txt/EU_manifestos/*.txt"),
                     docvarsfrom = "filenames",
                     docvarnames = c("unit", "context", "year", "language", "party"),
                     encoding = "LATIN1")
    rt$year = sample(2005:2007, nrow(rt), replace = TRUE)
    
    
    # Calculate tokens
    rt$tokens <- ntoken(corpus(rt), remove_punct = TRUE)
    
    # Find distribution by year
    rt %>% group_by(year) %>% summarize(total_tokens = sum(tokens))
    

    Output:

    # A tibble: 3 × 2
       year total_tokens
      <int>        <int>
    1  2005         5681
    2  2006        26564
    3  2007        24119