Search code examples
rdplyrsentimentr

Apply Sentimentr on Dataframe with Multiple Sentences in 1 String Per Row


I have a dataset where I am trying to get the sentiment by article. I have about 1000 articles. Each article is a string. This string has multiple sentences within it. I ideally would like to add another column that would summarise the sentiment for each article. Is there an efficient way to do this using dplyr?

Below is an example dataset with just 2 articles.

date<- as.Date(c('2020-06-24', '2020-06-24'))
text <- c('3 more cops recover as PNP COVID-19 infections soar to 519', 'QC suspends processing of PWD IDs after reports of abuse in issuance of cards')
link<- c('https://newsinfo.inquirer.net/1296981/3-more-cops-recover-as-pnps-covid-19-infections-soar-to-519,3,10,4,11,9,8', 'https://newsinfo.inquirer.net/1296974/qc-suspends-processing-of-pwd-ids-after-reports-of-abuse-in-issuance-of-cards')
V4 <-c('MANILA, Philippines — Three more police officers have recovered from the new coronavirus disease, increasing the total number of recoveries in the Philippine National Police to (PNP) 316., This developed as the total number of COVID-19 cases in the PNP rose to 519 with one new infection and nine deaths recorded., In a Facebook post on Wednesday, the PNP also recorded 676 probable and 876 suspects for the disease., PNP chief Gen. Archie Gamboa previously said the force would will intensify its health protocols among its personnel after recording a recent increase in deaths., The latest fatality of the ailment is a police officer in Cebu City, which is under enhanced community quarantine as COVID-19 cases continued to surge there., ATM, \r\n\r\nFor more news about the novel coronavirus click here.\r\nWhat you need to know about Coronavirus.\r\n\r\n\r\n\r\nFor more information on COVID-19, call the DOH Hotline: (02) 86517800 local 1149/1150.\r\n\r\n \r\n \r\n \r\n\r\n  \r\n , The Inquirer Foundation supports our healthcare frontliners and is still accepting cash donations to be deposited at Banco de Oro (BDO) current account #007960018860 or donate through PayMaya using this  link  .',
   'MANILA, Philippines — Quezon City will halt the processing of identification cards to persons with disability for two days starting Thursday, June 25, so it could tweak its guidelines after reports that unqualified persons had issued with the said IDs., In a statement on Wednesday, Quezon City Mayor Joy Belmonte said the suspension would the individual who issued PWD ID cards to six members of a family who were not qualified but who paid P2,000 each to get the IDs., Belmonte said the suspect, who is a local government employee, was already issued with a show-cause order to respond to the allegation., According to city government lawyer Nino Casimir, the suspect could face a grave misconduct case that could result in dismissal., The IDs are issued to only to persons qualified under the Act Expanding the Benefits and Privileges of Persons with Disability (Republic Act No. 10754)., The IDs entitle PWDs to a 20 percent discount and VAT exemption on goods and services., /atm')

df<-data.frame(date, text, link, V4)

head(df)

enter image description here

So I have been looking up how to do this using the sentimentr package and created below. However, this only outputs each sentences' sentiment (I do this by doing a strsplit of .,) and I want to instead aggregate everything at the full article level after applying this strsplit.

library(sentimentr)
full<-df %>%
  group_by(V4) %>%
  mutate(V2 = strsplit(as.character(V4), "[.],")) %>% 
  unnest(V2) %>%
  get_sentences() %>%
  sentiment()

The desired output I am looking for is to simply add an extra column my df dataframe with a summary sum(sentiment) for each article.

Additional info based on answer below:

date<- as.Date(c('2020-06-24', '2020-06-24'))
text <- c('3 more cops recover as PNP COVID-19 infections soar to 519', 'QC suspends processing of PWD IDs after reports of abuse in issuance of cards')
link<- c('https://newsinfo.inquirer.net/1296981/3-more-cops-recover-as-pnps-covid-19-infections-soar-to-519,3,10,4,11,9,8', 'https://newsinfo.inquirer.net/1296974/qc-suspends-processing-of-pwd-ids-after-reports-of-abuse-in-issuance-of-cards')
V4 <-c('MANILA, Philippines — Three more police officers have recovered from the new coronavirus disease, increasing the total number of recoveries in the Philippine National Police to (PNP) 316., This developed as the total number of COVID-19 cases in the PNP rose to 519 with one new infection and nine deaths recorded., In a Facebook post on Wednesday, the PNP also recorded 676 probable and 876 suspects for the disease., PNP chief Gen. Archie Gamboa previously said the force would will intensify its health protocols among its personnel after recording a recent increase in deaths., The latest fatality of the ailment is a police officer in Cebu City, which is under enhanced community quarantine as COVID-19 cases continued to surge there., ATM, \r\n\r\nFor more news about the novel coronavirus click here.\r\nWhat you need to know about Coronavirus.\r\n\r\n\r\n\r\nFor more information on COVID-19, call the DOH Hotline: (02) 86517800 local 1149/1150.\r\n\r\n \r\n \r\n \r\n\r\n  \r\n , The Inquirer Foundation supports our healthcare frontliners and is still accepting cash donations to be deposited at Banco de Oro (BDO) current account #007960018860 or donate through PayMaya using this  link  .',
   'MANILA, Philippines — Quezon City will halt the processing of identification cards to persons with disability for two days starting Thursday, June 25, so it could tweak its guidelines after reports that unqualified persons had issued with the said IDs., In a statement on Wednesday, Quezon City Mayor Joy Belmonte said the suspension would the individual who issued PWD ID cards to six members of a family who were not qualified but who paid P2,000 each to get the IDs., Belmonte said the suspect, who is a local government employee, was already issued with a show-cause order to respond to the allegation., According to city government lawyer Nino Casimir, the suspect could face a grave misconduct case that could result in dismissal., The IDs are issued to only to persons qualified under the Act Expanding the Benefits and Privileges of Persons with Disability (Republic Act No. 10754)., The IDs entitle PWDs to a 20 percent discount and VAT exemption on goods and services., /atm')

df<-data.frame(date, text, link, V4)

df %>%
  group_by(V4) %>% # group by not really needed
  mutate(V4 = gsub("[.],", ".", V4), 
         sentiment_score = sentiment_by(V4)) 

# A tibble: 2 x 5
# Groups:   V4 [2]
  date       text                      link                                V4                                                  sentiment_score$e~ $word_count   $sd $ave_sentiment
  <date>     <chr>                     <chr>                               <chr>                                                            <int>       <int> <dbl>          <dbl>
1 2020-06-24 3 more cops recover as P~ https://newsinfo.inquirer.net/1296~ "MANILA, Philippines — Three more police officers ~                  1         172 0.204       -0.00849
2 2020-06-24 QC suspends processing o~ https://newsinfo.inquirer.net/1296~ "MANILA, Philippines — Quezon City will halt the p~                  1         161 0.329       -0.174  
Warning message:
Can't combine <sentiment_by> and <sentiment_by>; falling back to <data.frame>.
x Some attributes are incompatible.
i The author of the class should implement vctrs methods.
i See <https://vctrs.r-lib.org/reference/faq-error-incompatible-attributes.html>. 

Solution

  • If you need the sentiment over the whole text, there is no need to split the text first into sentences, the sentiment functions take care of this. I replaced the ., in your text back to periods as this is needed for the sentiment functions. The sentiment functions recognizes "mr." as not being the end of a sentence. If you use get_sentences() first, you get the sentiment per sentence and not over the whole text.

    The function sentiment_by handles the sentiment over the whole text and averages it nicely. Check help with the option for the averaging.function if you need to change this. The by part of the function can deal with any grouping you want to apply.

    df %>%
      group_by(V4) %>% # group by not really needed
      mutate(V4 = gsub("[.],", ".", V4), 
             sentiment_score = sentiment_by(V4)) 
    
    # A tibble: 2 x 5
    # Groups:   V4 [2]
      date       text               link                      V4                            sentiment_score$~ $word_count   $sd $ave_sentiment
      <date>     <chr>              <chr>                     <chr>                                     <int>       <int> <dbl>          <dbl>
    1 2020-06-24 3 more cops recov~ https://newsinfo.inquire~ "MANILA, Philippines — Three~                 1         172 0.204       -0.00849
    2 2020-06-24 QC suspends proce~ https://newsinfo.inquire~ "MANILA, Philippines — Quezo~                 1         161 0.329       -0.174