r text-mining sentiment-analysis tidytext

R - Finding top words in each NRC sentiment and emotion using syuzhet package

Snapshot of the dataset:

I'm getting following chart:

Here is the code:

library(tidytext)
library(syuzhet)

lyrics$lyric <- as.character(lyrics$lyric)

tidy_lyrics <- lyrics %>% 
  unnest_tokens(word,lyric)

song_wrd_count <- tidy_lyrics %>% count(track_title)

lyric_counts <- tidy_lyrics %>%
  left_join(song_wrd_count, by = "track_title") %>% 
  rename(total_words=n)

lyric_sentiment <- tidy_lyrics %>% 
  inner_join(get_sentiments("nrc"),by="word")

lyric_sentiment %>% 
count(word,sentiment,sort=TRUE) %>%
group_by(sentiment)%>%top_n(n=10) %>% 
ungroup() %>%
  ggplot(aes(x=reorder(word,n),y=n,fill=sentiment)) + 
  geom_col(show.legend = FALSE) + 
  facet_wrap(~sentiment,scales="free") + 
  coord_flip()

The issue is that I'm not sure if the result I'm getting is correct or not. For instance, you can see 'bad' is part of multiple emotions. Also, if we inspect lyric_sentiment, we'd see that word 'shame' is present four times for 'Tim McGraw'. In reality it appears only twice in this song.

What's the right approach?

Solution

You are doing it correct. nrc sentiments can place words in multiple sentiment sections. You can see this in the following example. You can also look up values on the nrc homepage

library(dplyr)
library(tidytext)

nrc <- get_sentiments("nrc")
nrc %>% filter(word %in% c("bad", "shame"))
# A tibble: 9 x 2
  word  sentiment
  <chr> <chr>    
1 bad   anger    
2 bad   disgust  
3 bad   fear     
4 bad   negative 
5 bad   sadness  
6 shame disgust  
7 shame fear     
8 shame negative 
9 shame sadness