Search code examples
rlapplytibbletidytext

Using function to calculate a score, then put into a dataframe or tibble with right variable


I am working on a function that will hopefully perform a sentiment analysis for each emotion in the NRC dictionary on a list (see: https://www.tidytextmining.com/sentiment.html#sentiment-analysis-with-inner-join), and then save the score itself as a variable in a dataframe or tibble. I've got the actual analysis part down, but saving it in the dataframe or tibble is not working.

#Creating List of All Emotions To Apply This To
emotion <- c('anger', 'disgust', 'joy', 'surprise', 'anticip', 'fear', 'sadness', 'trust')
#Initialize List with Length of Emotion Vector
wcount <- vector("list", length(emotion))

#Create Tibble for me to Deposit the Result Into
nrc_tib <-tibble(id="", 
                anger=numeric(0), 
                disgust=numeric(0), 
                joy=numeric(0), 
                surprise=numeric(0), 
                anticip=numeric(0), 
                fear=numeric(0), 
                sadness=numeric(0), 
                trust=numeric(0))
#Create Row to Deposit Variable Into
nrc_tib <-add_row(nrc_tib, 'id'="transcript1.txt")

#Defining Function
sentimentanalysis_nrc <- function(emoi) {

  #Getting Sentiment, Filtering by Emotion in List
  nrc_list <- get_sentiments("nrc") %>% 
    filter(sentiment == emoi)

  #Conducting Sentiment Analysis, Saving Results
  wcount[[emoi]] <- wordcount  %>%
    inner_join(nrc_list) %>%
    count(word, sort = TRUE)

    #Calculating Sentiment Score for Given Emotion
    score <- sum(wcount[[emoi]]$n)

    #Saving Emotion in nrc_tib, which is the part that doesn't work
    nrc_tib$emoi <- score
}

#Running the Function
lapply(emotion, FUN = sentimentanalysis_nrc)


I've tried a few different things, including putting emoi in brackets in the line that doesn't work, and some googling suggests that isn't allowed. What would be allowed if I wanted to save it?

Note: If this helps for context...this example uses the file transcript1.txt, but my goal eventually is to generalize this across transcript2.txt-transcript45.txt, binding the scores for all 45 transcripts together afterwards.

EDIT: I came up with a clunky solution, using:

nrc_tib <<- replace(nrc_tib, emoi, score)

But there's got to be a better solution than that.


Solution

  • One of the big benefits of using tidy data principles is that problems like this become quite tractable! You can do this using joins.

    I'll using Jane Austen's novels as examples since you didn't post example data. Think of each book as one of your transcripts. The first step is to tidy the text data using unnest_tokens().

    library(tidyverse)
    library(tidytext)
    library(janeaustenr)
    
    tidy_books <- austen_books() %>%
      unnest_tokens(word, text)
    
    tidy_books
    #> # A tibble: 725,055 x 2
    #>    book                word       
    #>    <fct>               <chr>      
    #>  1 Sense & Sensibility sense      
    #>  2 Sense & Sensibility and        
    #>  3 Sense & Sensibility sensibility
    #>  4 Sense & Sensibility by         
    #>  5 Sense & Sensibility jane       
    #>  6 Sense & Sensibility austen     
    #>  7 Sense & Sensibility 1811       
    #>  8 Sense & Sensibility chapter    
    #>  9 Sense & Sensibility 1          
    #> 10 Sense & Sensibility the        
    #> # … with 725,045 more rows
    

    Then you can perform the sentiment analysis using an inner_join(). Notice that with this join, you will successfully match up each emotion with each word (the words are in this dataframe more than once, when appropriate).

    tidy_books %>%
      inner_join(get_sentiments("nrc"))
    #> Joining, by = "word"
    #> # A tibble: 177,363 x 3
    #>    book                word        sentiment   
    #>    <fct>               <chr>       <chr>       
    #>  1 Sense & Sensibility sense       positive    
    #>  2 Sense & Sensibility sensibility positive    
    #>  3 Sense & Sensibility long        anticipation
    #>  4 Sense & Sensibility respectable positive    
    #>  5 Sense & Sensibility respectable trust       
    #>  6 Sense & Sensibility general     positive    
    #>  7 Sense & Sensibility general     trust       
    #>  8 Sense & Sensibility good        anticipation
    #>  9 Sense & Sensibility good        joy         
    #> 10 Sense & Sensibility good        positive    
    #> # … with 177,353 more rows
    

    Now you can count() up the sentiment scores for each book (transcript in your case) and emotion/affect.

    tidy_books %>%
      inner_join(get_sentiments("nrc")) %>%
      count(book, sentiment)
    #> Joining, by = "word"
    #> # A tibble: 60 x 3
    #>    book                sentiment        n
    #>    <fct>               <chr>        <int>
    #>  1 Sense & Sensibility anger         1343
    #>  2 Sense & Sensibility anticipation  3698
    #>  3 Sense & Sensibility disgust       1172
    #>  4 Sense & Sensibility fear          1861
    #>  5 Sense & Sensibility joy           3364
    #>  6 Sense & Sensibility negative      4005
    #>  7 Sense & Sensibility positive      7429
    #>  8 Sense & Sensibility sadness       2064
    #>  9 Sense & Sensibility surprise      1589
    #> 10 Sense & Sensibility trust         4222
    #> # … with 50 more rows
    

    You can even pipe straight to make a plot!

    tidy_books %>%
      inner_join(get_sentiments("nrc")) %>%
      count(book, sentiment) %>%
      ggplot(aes(sentiment, n, fill = sentiment)) +
      geom_col(show.legend = FALSE) +
      facet_wrap(~book, scales = "free_y") +
      coord_flip()
    #> Joining, by = "word"
    

    Created on 2019-12-13 by the reprex package (v0.3.0)