Search code examples
rdplyrsentiment-analysistidytext

Errors in counting + combining bing sentiment score variables in Tidytext?


I'm doing sentiment analysis on a large corpus of text. I'm using the bing lexicon in tidytext to get simple binary pos/neg classifications, but want to calculate the ratios of positive to total (positive & negative) words within a document. I'm rusty with dplyr workflows, but I want to count the number of words coded as "positive" and divide it by the total count of words classified with a sentiment.

I tried this approach, using sample code and stand-in data . . .

library(tidyverse)
library(tidytext)

#Creating a fake tidytext corpus
df_tidytext <- data.frame(
  doc_id = c("Iraq_Report_2001", "Iraq_Report_2002"),
  text = c("xxxx", "xxxx") #Placeholder for text
)

#Creating a fake set of scored words with bing sentiments 
#for each doc in corpus
df_sentiment_bing <- data.frame(
  doc_id = c((rep("Iraq_Report_2001", each = 3)), 
             rep("Iraq_Report_2002", each = 3)),
  word = c("improve", "democratic", "violence",
           "sectarian", "conflict", "insurgency"),
  bing_sentiment = c("positive", "positive", "negative",
                "negative", "negative", "negative") #Stand-ins for sentiment classification
)

#Summarizing count of positive and negative words
# (number of positive words out of total scored words in each doc)
df_sentiment_scored <- df_tidytext %>%
  left_join(df_sentiment_bing) %>%
  group_by(doc_id) %>%
  count(bing_sentiment) %>%
  pivot_wider(names_from = bing_sentiment, values_from = n) %>%
  summarise(bing_score = count(positive)/(count(negative) + count(positive)))

But I get the following error:

"Error: Problem with `summarise()` input `bing_score`.
x no applicable method for 'count' applied to an object of class "c('integer', 'numeric')"
ℹ Input `bing_score` is `count(positive)/(count(negative) + count(positive))`.
ℹ The error occurred in group 1: doc_id = "Iraq_Report_2001".

Would love some insight into what I'm doing wrong with my summarizing workflow here.


Solution

  • I don't understand what is the point of counting there if the columns are numeric. By the way, that is also why you are having the error.

    One solution could be:

    #Summarizing count of positive and negative words
    # (number of positive words out of total scored words in each doc)
     df_tidytext %>%
      left_join(df_sentiment_bing) %>%
      group_by(doc_id) %>%
      dplyr::count(bing_sentiment) %>%
      pivot_wider(names_from = bing_sentiment, values_from = n) %>%
      replace(is.na(.), 0) %>%
      summarise(bing_score = sum(positive)/(sum(negative) + sum(positive)))
    

    The result you should get its:

    Joining, by = "doc_id"
    # A tibble: 2 × 2
      doc_id           bing_score
      <fct>                 <dbl>
    1 Iraq_Report_2001      0.667
    2 Iraq_Report_2002      0