Search code examples
rnlptime-seriesquanteda

Performing time series analysis of quanteda tokens


I am running into issues figuring out a way to pair time information with each individual token in quanteda. I want to run times series analysis on a list of 25 different tokens. I know that I can just find the index of each respective token but I was wondering if there was any way to attach date info directly to each individual token.


Solution

  • As far as I understand your question you want to keep the date information next to a text for time series analysis. Here are a few hints:

    creating the corpus

    First we create a corpus. Since you haven't supplied example data, I'll just use some random text created with the stringi package:

    library(quanteda)
    set.seed(1)
    text <- stringi::stri_rand_lipsum(nparagraphs = 30)
    length(text)
    #> [1] 30
    

    I create a vector of random dates to go along with that:

    date <- sample(seq(as.Date("1999/01/01"), as.Date("1999/02/01"), by = "day"), 30)
    

    Now we can create the corpus object. If you check the help of the corpus function (?corpus) you can see that there are different methods for different input objects. For character objects, we can supply additional document-level variables as a data.frame:

    corp <- corpus(x = text, 
                   docnames = NULL, 
                   docvars = data.frame(date = date))
    corp
    #> Corpus consisting of 30 documents and 1 docvar.
    

    creating and subsetting dfm

    Most analysis in quanteda is done with the help of document-feature matrix objects. Here we convert our corpus to a dfm and then only keep the features we want to analyse. In this case I just picked the most common words in the random text:

    dfm <- dfm(corp)
    
    dfm_sub <- dfm_keep(dfm, 
                        pattern = c("sed", "in"),
                        valuetype = "fixed", 
                        case_insensitive = TRUE)
    

    Now the dfm has many advatages but using it with other tools usually means that we need to convert it to some other object first. This seems to loose the date information but we can simply reattach it after the matrix is converted to a data.frame:

    df <- convert(dfm_sub, "data.frame")
    df$date <- dfm@docvars$date
    
    head(df)
    #>   document in sed       date
    #> 1    text1  4   4 1999-01-31
    #> 2    text2  6   8 1999-01-04
    #> 3    text3  1   3 1999-01-30
    #> 4    text4  1   6 1999-02-01
    #> 5    text5  3   5 1999-01-17
    #> 6    text6  2   5 1999-01-28
    

    time series

    You weren't very specific in what kind of analysis you want to conduct. When talking about time series I usually picture the first step to be a line plot. So this is what I do here:

    library(tidyr)
    library(dplyr)
    library(ggplot2)
    df %>% 
      pivot_longer("in":sed, names_to = "word") %>% 
      ggplot(aes(x = date, y = value, color = word)) +
      geom_line()