Search code examples
rldaquantedatopicmodels

Seeding words into an LDA topic model in R


I have a dataset of news articles that have been collected based on the criteria that they use the term "euroscepticism" or "eurosceptic". I have been running topic models using the lda package (with dfm matrices built in quanteda) in order to identify the main topics of these articles; however, the words I am interested in do not appear in any of the topics. I want to therefore seed these words into the model, and I am not sure exactly how to do that.

I see that the package topicmodels allows for an argument called seedwords, which "can be specified as a matrix or an object class of simple_triplet_matrix", but there are no other instructions. It seems that a simple_triplet_matrix only takes integers, and not strings - does anyone know I would then seed the words 'euroscepticism' and 'eurosceptic' into the model?

Here is a shortened version of the code:

library("quanteda")
library("lda")

##Load UK texts/create corpus
UKcorp <- corpus(textfile(file="~Michael/DM6/*"))

##Create document feature matrix 
UKdfm2 <- dfm(UKcorp, ngrams =1, verbose = TRUE, toLower = TRUE,
         removeNumbers = TRUE, removePunct = TRUE, removeSeparators = TRUE,
         removeTwitter = FALSE, stem = TRUE, ignoredFeatures =     
         stopwords(kind="english"), keptFeatures = NULL, language = "english",     
         thesaurus = NULL, dictionary = NULL, valuetype = "fixed"))

##Convert to lda model 
UKlda2 <- convert(UKdfm2, to = "lda")

##run model
UKmod2 <- lda.collapsed.gibbs.sampler(UKlda2$documents, K = 15, UKlda2$vocab,  
          num.iterations = 1500, alpha = .1,eta = .01, initial = NULL, burnin 
          = NULL, compute.log.likelihood = TRUE, trace = 0L, freeze.topics = FALSE)

Solution

  • "Seeding" words in the topicmodels package is a different procedure, as it allows you when estimating through the collapsed Gibbs sampler to attach prior weights for words. (See for instance Jagarlamudi, J., Daumé, H., III, & Udupa, R. (2012). Incorporating lexical priors into topic models (pp. 204–213). Association for Computational Linguistics.) But this is part of an estimation strategy for topics, not a way of ensuring that key words of interest remain in your fitted topics. Unless you have set a threshold for removing them based on sparsity, before calling lad::lda.collapsed.gibbs.sampler(), then every term in your UKlda2$vocab vector will be assigned probabilities across topics.

    Probably what is happening here is that your words are either of such low frequency that they are hard to locate near the top of any of your topics. It's also possible that stemming has changed them, e.g.:

    quanteda::char_wordstem("euroscepticism")
    ## [1] "eurosceptic"
    

    I suggest you make sure that your words exist in dfm first, through:

    colSums(UKdfm2)["eurosceptic"]
    

    And then you can look at the fitted distribution of topic proportions for this word and others in the fitted topic model object.