Search code examples
rtextquanteda

Text dictionary-based sentiment analysis (tidytext)


I think I have done all the steps necessary to prepare my textual data for dictionary-based sentiment analysis, but I am struggling to run the sentiment analysis itself. Specifically, I have removed unnecessary characters, stemming, and stop words, but I am not sure how to run the sentiment analysis itself as shown below.

#Loading packages
library(tidyverse)
library(textdata)
library(tidytext)
require(writexl)
library(quanteda)

data example

dput(df[1:5,c(1,2,3)])

output:

structure(list(id = 1:5, username = c("106gunner", "CPTMiller", 
"matey1982", "Why so serious", "Joe Maya"), post = c("Was reported in SCMP news source underneath link", 
"Government already said ft or CECA create new good jobs for Singaporean", 
"gunner said Was reported in SCMP news source underneath linkClick to expand arent u stating the obvious", 
"lightboxclose Close lightboxnext Next lightboxprevious Previous lightboxerror The requested content cannot be loaded Please try again later lightboxstartslideshow Start slideshow lightboxstopslideshow Stop slideshow lightboxfullscreen Full screen lightboxthumbnails Thumbnails lightboxdownload Download lightboxshare Share lightboxzoom Zoom lightboxnewwindow New window lightboxtogglesidebar Toggle sidebar", 
"From personal experience i lost my job to jhk")), row.names = c(NA, 
-5L), class = c("tbl_df", "tbl", "data.frame"))
## Remove specific characters that add no value to the post.
strings_to_remove <- c("click","expand","Click","to", "can", "like", "also", "go", "just", "even", "now", "see", "got", "another", "dont", 
                       "know",">" ,"get","ones","team","didnt","first","mostly","old", "long", "time", "well", 
                       "going", "think", "still", "wanted", "instead", "times", "years", "high", "big", "thats", "using")

regex<-paste(paste0("(^|\\s+)", strings_to_remove, "\\.?", "(?=\\s+|$)"),collapse="|")

df_test <- corpus_all %>% 
  mutate(post = str_remove_all(post, regex))

df_test$post <- gsub("Click to expand", "", df_test$post)

#Converting dataframe into a corpus object
df<- corpus(df_test,
                        docid_field = "id",
                        text_field = "post")

#Loading list of coloquial stop words 
stopwords <- c(stopwords("en", source = "marimo"))


#Obtaining a DTM removing punctuation, numbers, and stopwords
toks <- tokens(df, 
               remove_punct = TRUE, 
               remove_numbers = TRUE) %>% 
  tokens_remove(pattern = stopwords)

dtm_c <- dfm(toks)

#stopwords can be removed from other sources such as "misc"
#looking at the list it seems like marimo has more words

#Looking at number of features:
dtm_c

#Stemming to reduce multiple conjugations/forms of a word to its root
tab  <- dfm_wordstem(dtm_c, language = "en")

tab<- na.omit(tab)

head(tab)

I then ran the code below based on the solution here but I am unable to solve the error message that I receive

"Error in UseMethod("inner_join") : no applicable method for 'inner_join' applied to an object of class "tokens"

#get the sentiment from the first text: 
toks %>%
  inner_join(get_sentiments("bing")) %>% # pull out only sentiment words
  dplyr::count(sentiment) %>% # count the # of positive & negative words
  spread(sentiment, n, fill = 0) %>% # made data wide rather than narrow
  mutate(sentiment = positive - negative) # # of positive words - # of negative owrds

Solution

  • You're running into problems because you're using two great packages, tidytext and quanteda, that work in different ways and don't always work well together. tidytext works with regular data frames or tibbles, and quanteda works with custom data structures.

    Here's a reproducible example that uses your data to find the sentiment-bearing words using tidytext and the bing sentiment dictionary. It has three steps:

    1. It unnests the column "post" into a new column called "word", with one row for each word.
    2. Then, it removes a set of stop words (junk words like "it", "a", and so on) using a library built into the package tidytext.
    3. Then it joins our column of remaining words with the bing sentiment dictionary.
    library(tidytext)
    library(dplyr)
    
    df <- structure(
      list(
        id = 1:5, username = c(
          "106gunner", "CPTMiller", "matey1982", "Why so serious", "Joe Maya"), 
        post = c("Was reported in SCMP news source underneath link", 
                 "Government already said ft or CECA create new good jobs for Singaporean", 
                 "gunner said Was reported in SCMP news source underneath linkClick to expand arent u stating the obvious", 
                 "lightboxclose Close lightboxnext Next lightboxprevious Previous lightboxerror The requested content cannot be loaded Please try again later lightboxstartslideshow Start slideshow lightboxstopslideshow Stop slideshow lightboxfullscreen Full screen lightboxthumbnails Thumbnails lightboxdownload Download lightboxshare Share lightboxzoom Zoom lightboxnewwindow New window lightboxtogglesidebar Toggle sidebar", 
                 "From personal experience i lost my job to jhk")), 
      row.names = c(NA, -5L), class = c("tbl_df", "tbl", "data.frame"))
    
    
    # Step 1: unnest the column "post" into a new column called "word", with one row
    # for each word.
    # Step 2: anti_join() with a set of stopwords (junk words we don't care about)
    # Step 3: join our column of remaining words with the bing sentiment dictionary
    df |>
      tidytext::unnest_tokens(output="word", input="post") |>
      dplyr::anti_join(tidytext::stop_words)|>
      dplyr::inner_join(tidytext::get_sentiments("bing"))
    

    As a next step you could use dplyr::group() and dplyr::summarize() to count instances of positive or negative words, or look at other dictionaries that sometimes give numeric weightings instead of just positive/negative ratings.

    Please note also that your original example was not reproducible on my machine, because the variable corpus_all doesn't seem to be defined. You'll get better answers if you post reproducible examples.