Search code examples
rtext-miningsentiment-analysistidytext

R - Finding top words in each NRC sentiment and emotion using syuzhet package


Snapshot of the dataset:

enter image description here

I'm getting following chart:

enter image description here

Here is the code:

library(tidytext)
library(syuzhet)

lyrics$lyric <- as.character(lyrics$lyric)

tidy_lyrics <- lyrics %>% 
  unnest_tokens(word,lyric)

song_wrd_count <- tidy_lyrics %>% count(track_title)

lyric_counts <- tidy_lyrics %>%
  left_join(song_wrd_count, by = "track_title") %>% 
  rename(total_words=n)

lyric_sentiment <- tidy_lyrics %>% 
  inner_join(get_sentiments("nrc"),by="word")

lyric_sentiment %>% 
count(word,sentiment,sort=TRUE) %>%
group_by(sentiment)%>%top_n(n=10) %>% 
ungroup() %>%
  ggplot(aes(x=reorder(word,n),y=n,fill=sentiment)) + 
  geom_col(show.legend = FALSE) + 
  facet_wrap(~sentiment,scales="free") + 
  coord_flip()

The issue is that I'm not sure if the result I'm getting is correct or not. For instance, you can see 'bad' is part of multiple emotions. Also, if we inspect lyric_sentiment, we'd see that word 'shame' is present four times for 'Tim McGraw'. In reality it appears only twice in this song.

What's the right approach?


Solution

  • You are doing it correct. nrc sentiments can place words in multiple sentiment sections. You can see this in the following example. You can also look up values on the nrc homepage

    library(dplyr)
    library(tidytext)
    
    nrc <- get_sentiments("nrc")
    nrc %>% filter(word %in% c("bad", "shame"))
    # A tibble: 9 x 2
      word  sentiment
      <chr> <chr>    
    1 bad   anger    
    2 bad   disgust  
    3 bad   fear     
    4 bad   negative 
    5 bad   sadness  
    6 shame disgust  
    7 shame fear     
    8 shame negative 
    9 shame sadness