Search code examples
rggplot2ldatopic-modelingtopicmodels

Graph a single LDA topic by date (in R)


I have a group of text files from several journals (let's call them journal A and journal B) that I am trying to run LDA on. I divide them each into their own corpus, then attach the names of the files to each corpus, store the journal of origin under the origin label, and finally, combine the two corpuses into myCorpus:

library(tm); library(topicmodels);

txtfolder <- "~/Path/to/txtfiles/"
source <- DirSource(txtfolder)
A.names <- list.files(path=txtfolder, pattern="A")
B.names <- list.files(path=txtfolder, pattern="B")
A.names <- lapply(X=A.names, FUN=function(i){gsub(".txt", '', x=i)})
B.names <- lapply(X=B.names, FUN=function(i){gsub(".txt", '', x=i)})
A.corpus <- Corpus(A.source, readerControl=list(reader=readPlain))
for (i in 1:length(A.corpus)){
  meta(A.corpus[[i]], tag = "origin") <- "A"
}
B.corpus <- Corpus(B.source, readerControl=list(reader=readPlain))
for (i in 1:length(B.corpus)){
  meta(B.corpus[[i]], tag = "origin") <- "B"
}
myCorpus <- c(A.corpus, B.corpus) # combining the two corpuses

From here I run LDA on myCorpus:

myCorpus <- tm_map(myCorpus, PlainTextDocument)
dtm <- DocumentTermMatrix(myCorpus, control = list(minWordLength=3))
n.topics <- 5
lda.model <- LDA(dtm, n.topics)
terms(lda.model,10)

From here I would like to create a plot measuring the proportion of each journal that is due to a particular topic over time (I can identify the time that each issue of the journals was published by parsing the txt files, and store them in a vector similarly to how I did with the origin tag). I'm not sure how best to store this information so that I can use date published as the horizontal axis. More importantly, how can I create the graph that I mentioned?


Solution

  • I assume you want to use ggplot since you added this tag. You will first need to gather your data in a data frame.

    Assuming lda is the output of LDA() and corpus is the corpus, you'll find the topics with topics(lda) and the various informations on the documents using meta(corpus). You might need to adjust this to your corpus :

    df <- data.frame(id=names(topics(lda)),
                     topic=topics(lda),
                     date=as.POSIXct(unlist(lapply(meta(corpus,type="local",tag="datetimestamp"),as.character))),
                     origin=unlist(meta(corpus,type="local",tag="origin"))    )
    

    Then, you need to compute the stats you want to plot : the frequency of each topic by date and by origin, and

    library(dplyr)
    library(tidyr)
    M <- df %>% gather(key,value,topic) %>%
      group_by(date,origin,value) %>%
      summarize(n=n()) %>%
      mutate(f=n/sum(n))
    

    Finally, in order to plot it :

    library(ggplot2)
    
    ggplot(data=M,aes(x=date,fill=factor(value),y=f)) + 
      geom_bar(stat="identity",position="stack") +
      facet_grid(~origin)
    

    Here is what it gives with simulated data

    set.seed(100)
    df <- data.frame(date=sample(seq.Date(as.Date("2015-07-27 10:12:25"),as.Date("2015-07-31 10:12:25"),by=1),100,replace=TRUE),
           id=1:100,
           topic=sample(1:5,100,replace=TRUE),
           origin=sample(c("A","B"),100,replace=TRUE))
    

    enter image description here