Search code examples
rggplot2tmldatopic-modeling

Plot the evolution of an LDA topic across time


I'd like to plot how the proportion of a particular topic changes over time, but I've been having some trouble isolating a single topic and plotting over time, especially for plotting multiple groups of documents separately (let's create two groups to compare - journals A and B). I've saved dates associated with these journals in a function called dateConverter.

Here's what I have so far (with much thanks to @scoa):

library(tm); library(topicmodels);


txtfolder <- "~/path/to/documents/"
source <- DirSource(txtfolder)

myCorpus <- Corpus(source, readerControl=list(reader=readPlain))


for (i in 1:10){
  meta(myCorpus[[i]], tag = "origin") <- "A"
}
for (i in 11:length(myCorpus)){
  meta(myCorpus[[i]], tag = "origin") <- "B"
}
dates <- do.call("c", dateConverter)
for (i in 1:length(myCorpus)){
  meta(myCorpus[[i]], tag = "datetimestamp") <- dates[i]
}

dtm <- DocumentTermMatrix(myCorpus, control = list(minWordLength=3))


n.topics <- 10
lda.model <- LDA(dtm, n.topics)
terms(lda.model,10)
df <- data.frame(id=names(topics(lda.model)),
                 topic=posterior(lda.model),
                 date=as.POSIXct(unlist(lapply(meta(myCorpus,type="local",tag="datetimestamp"),as.character))),
                 origin=unlist(meta(myCorpus,type="local",tag="origin"))    )

How can I plot these?


Solution

  • This is just an adaptation of my previous answer :

    ## Load the data
    library(tm)
    
    ## Use built-in data set
    data(acq)
    myCorpus <- acq
    
    ## prepare the data
    for (i in 1:25){
      meta(myCorpus[[i]], tag = "origin") <- "A"
    }
    
    for (i in 26:length(myCorpus)){
      meta(myCorpus[[i]], tag = "origin") <- "B"
    }
    
    dates <- sample(seq.Date(as.Date("2013-01-01"),as.Date("2014-01-01"),length.out=8),50, replace=TRUE)
    
    for (i in 1:length(myCorpus)){
      meta(myCorpus[[i]], tag = "datetimestamp") <- dates[i]
    }
    
    dtm <- DocumentTermMatrix(myCorpus, control = list(minWordLength=3))
    
    
    library(topicmodels)    
    n.topics <- 5
    lda.model <- LDA(dtm, n.topics)
    terms(lda.model,10)
    

    Reshape data for plotting. I take the mean posterior for each group of topics, date and origin.

    df <- data.frame(id=names(topics(lda.model)),                 
                     date=as.POSIXct(unlist(lapply(meta(myCorpus,type="local",tag="datetimestamp"),as.character))),
                     origin=unlist(meta(myCorpus,type="local",tag="origin"))    )
    
    dft <- cbind(df,posterior(lda.model)$topics)
    
    library(dplyr)
    library(tidyr)
    M <- gather(dft,topic,value,-id,-date,-origin) %>%
      group_by(topic,date,origin) %>%
      summarize(value=mean(value))
    

    Plot

    library(ggplot2)
    ggplot(M,aes(x=date,color=origin,y=value)) + 
      geom_point() +
      geom_line() +
      facet_grid(topic~origin)
    

    enter image description here