I recently asked a question about entries that are omitted after a sentiment analysis. The tweets that I analyse don't always contain words that are in the lexicon. I would like to know which ones can't be translated. So I would like to keep these even if zero words were scored. In my previous question, the drop parameter was given as a solution. However, I think I might be doing it wrong or missing something. This is my first time working with these techniques.
The following function takes a data frame and gives a new one in return, containing the amount of positive and negative words along with the sentiment.
The input (with one text in Dutch on purpose so it can't be scored)
id <- c(1, 2, 3)
date <- c("12-05-2021", "12-06-2021", "12-07-2021")
text <- c("Dit is tekst in het Nederlands", "I,m so happy that websites like this exsist", "This icecream tastes terrible. It made me upset")
df <- data.frame(id, date, text)
What i want as output is:
sentiment positive negative
0 0 0
2 2 0
-2 0 2
But my function gives me something else:
sentimentAnalysis <- function(tweetData){
sentimentDataframe <- data.frame()
for(row in 1:nrow(tweetData)){
tekst <- as.character(tweetData[row, "text"])
positive <- 0
negative <- 0
tokens <- tibble(text = tekst) %>% unnest_tokens(word, text, drop = FALSE)
sentiment <- tokens %>%
inner_join(get_sentiments("bing")) %>%
count(sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative)
sentimentDataframe <- bind_rows(sentimentDataframe, sentiment)
}
sentimentDataframe[is.na(sentimentDataframe)] <- 0
return(sentimentDataframe)
}
This still returns a data frame with the unscored texts missing. As you can see, the first text is omitted:
sentiment positive negative
2 2 0
-2 0 2
If there are no rows returned after the join you can return a tibble with all 0 values. We can use an if
condition to check this.
In cases when there is only positive or negative sentiment in a sentence, complete
would create another row with opposite sentiment and assign it the value 0. Also replaced spread
with pivot_wider
since spread
is now superseded.
library(tidyverse)
library(tidytext)
map_df(df$text, ~{
tibble(text = .x) %>%
unnest_tokens(word, text, drop = FALSE) %>%
inner_join(get_sentiments("bing")) -> tmp
if(nrow(tmp) == 0) tibble(sentiment = 0, positive = 0, negative = 0)
else {
tmp %>%
count(sentiment) %>%
complete(sentiment = c('positive', 'negative'), fill = list(n = 0)) %>%
pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
mutate(sentiment = positive - negative)
}
}) -> res
res
# sentiment positive negative
# <dbl> <dbl> <dbl>
#1 0 0 0
#2 2 2 0
#3 -2 0 2