I wrote a small R script. Input are text files (thousands of journal articles). I generated the metadata (including the publication year) from the file names. Now I want to calculate the total number of tokens per year. However, I am not getting anywhere here.
# Metadata from filenames
rawdata_SPARA <- readtext("SPARA_paragraphs/*.txt", docvarsfrom = "filenames", dvsep="_",
docvarnames = c("Unit", "Year", "Volume", "Issue"))
# we add some more metadata columns to the data frame
rawdata_SPARA$Year <- substr(rawdata_SPARA$Year, 0, 4)
# Corpus
SPARA_corp <- corpus(rawdata_SPARA)
Does anyone here know a solution?
I used tokens_by function of the quanteda package which seems to be outdated.
You do not need to substring substr(rawdata_SPARA$Year, 0, 4)
. While calling readtext
function, it extracts the year from the file name. In the example below the file names have the structure like EU_euro_2004_de_PSE.txt
and automatically 2004
will be inserted into readtext
object. As it inherits from data.frame
you can use standard data manipulation functions, e.g. from dplyr
package.
Then just group_by
by year and summarize
by tokens. Number of tokens was calculated by quanteda
s ntoken
function.
See the code below:
library(readtext)
library(quanteda)
# Prepare sample corpus
set.seed(123)
DATA_DIR <- system.file("extdata/", package = "readtext")
rt <- readtext(paste0(DATA_DIR, "/txt/EU_manifestos/*.txt"),
docvarsfrom = "filenames",
docvarnames = c("unit", "context", "year", "language", "party"),
encoding = "LATIN1")
rt$year = sample(2005:2007, nrow(rt), replace = TRUE)
# Calculate tokens
rt$tokens <- ntoken(corpus(rt), remove_punct = TRUE)
# Find distribution by year
rt %>% group_by(year) %>% summarize(total_tokens = sum(tokens))
Output:
# A tibble: 3 × 2
year total_tokens
<int> <int>
1 2005 5681
2 2006 26564
3 2007 24119