Snapshot of the dataset:
I'm getting following chart:
Here is the code:
library(tidytext)
library(syuzhet)
lyrics$lyric <- as.character(lyrics$lyric)
tidy_lyrics <- lyrics %>%
unnest_tokens(word,lyric)
song_wrd_count <- tidy_lyrics %>% count(track_title)
lyric_counts <- tidy_lyrics %>%
left_join(song_wrd_count, by = "track_title") %>%
rename(total_words=n)
lyric_sentiment <- tidy_lyrics %>%
inner_join(get_sentiments("nrc"),by="word")
lyric_sentiment %>%
count(word,sentiment,sort=TRUE) %>%
group_by(sentiment)%>%top_n(n=10) %>%
ungroup() %>%
ggplot(aes(x=reorder(word,n),y=n,fill=sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment,scales="free") +
coord_flip()
The issue is that I'm not sure if the result I'm getting is correct or not. For instance, you can see 'bad' is part of multiple emotions. Also, if we inspect lyric_sentiment
, we'd see that word 'shame' is present four times for 'Tim McGraw'. In reality it appears only twice in this song.
What's the right approach?
You are doing it correct. nrc sentiments can place words in multiple sentiment sections. You can see this in the following example. You can also look up values on the nrc homepage
library(dplyr)
library(tidytext)
nrc <- get_sentiments("nrc")
nrc %>% filter(word %in% c("bad", "shame"))
# A tibble: 9 x 2
word sentiment
<chr> <chr>
1 bad anger
2 bad disgust
3 bad fear
4 bad negative
5 bad sadness
6 shame disgust
7 shame fear
8 shame negative
9 shame sadness