Search code examples
rsentiment-analysissentimentr

Need clarification on the calculation of average polarity score returned by sentiment function of sentimentr(trinker)


I am using sentiment analysis function sentiment_by() from R package sentimentr (by trinker). I have a dataframe containing the following columns: review comments month year I ran the sentiment_by function on the dataframe to find the average polarity score based on the year and month and i get the following values.

review_year review_month    word_count  sd  ave_sentiment
2015       March        8722    0.381686065 0.163440921
2015       April        7758    0.387046768 0.158812775
2015       May          7333    0.389256472 0.149220636
2015       November    14020    0.394711478 0.14691745
2016       February     7974    0.400406931 0.142345278
2015       September    8238    0.379989344 0.141740366
2015       February     7642    0.361415304 0.141624745
2015       December    24863    0.387409099 0.141606892
2016       March        8229    0.389033232 0.138552943
2016       January      10472   0.388300946 0.134302612
2015       August       7520    0.3640285   0.127980712
2016       May          3432    0.422246851 0.125041218
2015       June         8678    0.356612924 0.119333949
2015       January      9930    0.351126449 0.119225549
2016       April        9344    0.397066458 0.111879315
2015       July         8450    0.349963536 0.108881821
2015       October      7630    0.38017201  0.1044298

Now i run the sentiment_by function on the dataframe based on the comments alone and then i run the following function on the resultant data frame to find the average polarity score based on year and months.

sentiment_df[,list(avg=mean(ave_sentiment)),by="month,year"]

I get the following results.

month       year        avg
January     2015    0.110950199
February    2015    0.126943461
March       2015    0.146546669
April       2015    0.148264268
May         2015    0.143924126
June        2015    0.110691204
July        2015    0.106472437
August      2015    0.118976304
September   2015    0.135362187
October     2015    0.111441484
November    2015    0.137699548
December    2015    0.136786867
January     2016    0.128645808
February    2016    0.129139898
March       2016    0.134595706
April       2016    0.12106743
May         2016    0.142801514

As per my understanding both should return the same results, correct me if I am wrong. Reason for me to go for the second approach is because i need to average polarity based on both month and year, as well as based on months and i don't want to use the method twice as it will cause additional time delay. Could some one let me know what i am doing wrong here?


Solution

  • Here is an idea: Maybe the first function is taking the averages from the individual sentences, and the second one is taking the average from the "ave sentiment", which is already an average. So, the average of averages is not always equal to the average of the individual elements.