Search code examples
rnlptext-miningtopic-modeling

Finding the dominant topic in each sentence in topic modeling


One question that I can't find the answer for in R is that how I can find the dominant topic in NLP model for each sentence? Imagine I have data frame like this:

comment <- c("outstanding renovation all improvements are topoftheline and done with energy efficiency in mind low monthly utilities even the interior",
             "solidly constructed lovingly maintained sf crest built",
             "one year since built new this well designed storey home",
             "beautiful street large bdm in the heart of lynn valley over sqft bathrooms",
             "rare to find legal beautiful upgr in port moody centre with a mountain view all bedroom units were nicely renovated",
             "fantastic opportunity to get value for the money excellent family home in desirable blueridge with legal selfcontained bachelor suite on the main floor great location close to swimming ice skating community",
             "original owner tired but rock solid perfect location half a block to norquay elementary school and short quiet blocks to slocan park and sky train station")

id <- c(1,2,3,4,5,6,7)

data <- data.frame(id, comment)

I do preprocess as shown below:

text_cleaning_tokens <- data %>% 
  tidytext::unnest_tokens(word, comment)
text_cleaning_tokens$word <- gsub('[[:digit:]]+', '', text_cleaning_tokens$word)
text_cleaning_tokens$word <- gsub('[[:punct:]]+', '', text_cleaning_tokens$word)


text_cleaning_tokens <- text_cleaning_tokens %>% filter(!(nchar(word) == 1))%>% 
  anti_join(stop_words)

stemmed_token <- text_cleaning_tokens %>% mutate(word=wordStem(word))


tokens <- stemmed_token %>% filter(!(word==""))
tokens <- tokens %>% mutate(ind = row_number())
tokens <- tokens %>% group_by(id) %>% mutate(ind = row_number()) %>%
  tidyr::spread(key = ind, value = word)
tokens [is.na(tokens)] <- ""
tokens <- tidyr::unite(tokens, clean_remark,-id,sep =" " )
tokens$clean_remark <- trimws(tokens$clean_remark)

The I ran FitLdaModel function on this data and finally, found the best topics based on 2 groups:

             t_1            t_2
1         beauti          built
2          block           home
3          renov          legal
4       bathroom          locat
5            bdm       bachelor
6      bdm_heart  bachelor_suit
7  beauti_street  block_norquai
8    beauti_upgr       blueridg
9        bedroom blueridg_legal
10  bedroom_unit   built_design

now based on the result I have, I want to find the most dominant topic in each sentence in topic modelling. For example, I want to know that for comment 1 ("outstanding renovation all improvements are topoftheline and done with energy efficiency in mind low monthly utilities even the interior"), which topic (topic 1 or topic 2) is the most dominant?

Can anyone help me with this question? do we have any package that can do this?


Solution

  • It is pretty easy to work with quanteda and topicmodels. The former is for data management and quantitative analysis of textual data, the latter is for topic modeling inference.

    Here I take your comment object and transform it to a corpus and then to a dfm. I then convert it to be understandable by topicmodels.

    The function LDA() gives you all you need to easily extract information. In particular, with get_topics() you get the most probable topic for each document. If you instead want to see the document-topic-weights you can do so with ldamodel@gamma. You will see that get_topics() does exactly what you asked.

    Please, see if this works for you.

    library(quanteda)
    #> Package version: 2.1.2
    #> Parallel computing: 2 of 16 threads used.
    #> See https://quanteda.io for tutorials and examples.
    #> 
    #> Attaching package: 'quanteda'
    #> The following object is masked from 'package:utils':
    #> 
    #>     View
    library(topicmodels)
    
    
    comment <- c("outstanding renovation all improvements are topoftheline and done with energy efficiency in mind low monthly utilities even the interior",
                 "solidly constructed lovingly maintained sf crest built",
                 "one year since built new this well designed storey home",
                 "beautiful street large bdm in the heart of lynn valley over sqft bathrooms",
                 "rare to find legal beautiful upgr in port moody centre with a mountain view all bedroom units were nicely renovated",
                 "fantastic opportunity to get value for the money excellent family home in desirable blueridge with legal selfcontained bachelor suite on the main floor great location close to swimming ice skating community",
                 "original owner tired but rock solid perfect location half a block to norquay elementary school and short quiet blocks to slocan park and sky train station")
    
    mycorp <- corpus(comment)
    docvars(mycorp, "id") <- 1L:7L
    
    mydfm <- dfm(mycorp)
    
    # convert the DFM to a Document Matrix for topicmodels
    forTM <- convert(mydfm, to = "topicmodels")
    
    myLDA <- LDA(forTM, k = 2)
    
    dominant_topics <- get_topics(myLDA)
    dominant_topics
    #> text1 text2 text3 text4 text5 text6 text7 
    #>     2     2     2     2     1     1     1
    
    dtw <- myLDA@gamma
    dtw
    #>           [,1]      [,2]
    #> [1,] 0.4870600 0.5129400
    #> [2,] 0.4994974 0.5005026
    #> [3,] 0.4980144 0.5019856
    #> [4,] 0.4938985 0.5061015
    #> [5,] 0.5037667 0.4962333
    #> [6,] 0.5000727 0.4999273
    #> [7,] 0.5176960 0.4823040
    

    Created on 2021-03-18 by the reprex package (v1.0.0)