Search code examples
rsentiment-analysistidytext

R tidytext sentiment analysis- how to use the drop parameter


I recently asked a question about entries that are omitted after a sentiment analysis. The tweets that I analyse don't always contain words that are in the lexicon. I would like to know which ones can't be translated. So I would like to keep these even if zero words were scored. In my previous question, the drop parameter was given as a solution. However, I think I might be doing it wrong or missing something. This is my first time working with these techniques.

The following function takes a data frame and gives a new one in return, containing the amount of positive and negative words along with the sentiment.

The input (with one text in Dutch on purpose so it can't be scored)

id <- c(1, 2, 3)
date <- c("12-05-2021", "12-06-2021", "12-07-2021")
text <- c("Dit is tekst in het Nederlands", "I,m so happy that websites like this exsist", "This icecream tastes terrible. It made me upset")

df <- data.frame(id, date, text)

What i want as output is:

sentiment     positive     negative
0             0            0
2             2            0
-2            0            2

But my function gives me something else:

sentimentAnalysis <- function(tweetData){
  
  sentimentDataframe <- data.frame()
  
  for(row in 1:nrow(tweetData)){
    
    tekst <- as.character(tweetData[row, "text"])
    
    positive <- 0
    negative <- 0
    
    tokens <- tibble(text = tekst) %>% unnest_tokens(word, text, drop = FALSE)
    
    sentiment <- tokens %>%
      inner_join(get_sentiments("bing")) %>% 
      count(sentiment) %>% 
      spread(sentiment, n, fill = 0) %>% 
      mutate(sentiment = positive - negative)
    
    
    sentimentDataframe <- bind_rows(sentimentDataframe, sentiment)
  }
  
  sentimentDataframe[is.na(sentimentDataframe)] <- 0
  return(sentimentDataframe)
  
}

This still returns a data frame with the unscored texts missing. As you can see, the first text is omitted:

sentiment     positive     negative
2             2            0
-2            0            2

Solution

  • If there are no rows returned after the join you can return a tibble with all 0 values. We can use an if condition to check this.

    In cases when there is only positive or negative sentiment in a sentence, complete would create another row with opposite sentiment and assign it the value 0. Also replaced spread with pivot_wider since spread is now superseded.

    library(tidyverse)
    library(tidytext)
    
    map_df(df$text, ~{
      tibble(text = .x) %>% 
        unnest_tokens(word, text, drop = FALSE) %>%
        inner_join(get_sentiments("bing")) -> tmp
      if(nrow(tmp) == 0) tibble(sentiment = 0, positive = 0, negative = 0)
      else {
      tmp %>%
        count(sentiment) %>% 
        complete(sentiment = c('positive', 'negative'), fill = list(n = 0)) %>%
        pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>% 
        mutate(sentiment = positive - negative)
      }
    }) -> res
    
    res
    #  sentiment positive negative
    #      <dbl>    <dbl>    <dbl>
    #1         0        0        0
    #2         2        2        0
    #3        -2        0        2