I am trying to do sentiment analysis on newspaper articles and track the sentiment level across time. To do that, basically I will identify all the relevant news articles within a day, feed them into the polarity() function and obtain the average polarity scores of all the articles (more precisely, the average of all the sentence from all the articles) within that day.
The problem is, for some days, there will be many more articles compared to other days, and I think this might mask some of the info if we simply track the daily average polarity score. For example, a score of 0.1 from 30 news articles should carry more weight compared to a score of 0.1 generated from only 3 articles. and sure enough, some of the more extreme polarity scores I obtained came from days whereby there are only few relevant articles.
Is there anyway I can take the different number of articles each day into consideration?
library(qdap)
sentence = c("this is good","this is not good")
polarity(sentence)
I would warn that sometimes saying something strong with few words may pack the most punch. Make sure what you're doing makes sense in terms of your data and research questions.
One approach would be to use number of words as in the following example (I like the first approach moreso here):
poldat2 <- with(mraja1spl, polarity(dialogue, list(sex, fam.aff, died)))
output <- scores(poldat2)
weight <- ((1 - (1/(1 + log(output[["total.words"]], base = exp(2))))) * 2) - 1
weight <- weigth/max(weight)
weight2 <- output[["total.words"]]/max(output[["total.words"]])
output[["weighted.polarity"]] <- output[["ave.polarity"]] * weight
output[["weighted.polarity2"]] <- output[["ave.polarity"]] * weight2
output[, -c(5:6)]
## sex&fam.aff&died total.sentences total.words ave.polarity weighted.polarity weighted.polarity2
## 1 f.cap.FALSE 158 1641 0.083 0.143583793 0.082504197
## 2 f.cap.TRUE 24 206 0.044 0.060969157 0.005564434
## 3 f.mont.TRUE 4 29 0.079 0.060996614 0.001397106
## 4 m.cap.FALSE 73 651 0.031 0.049163984 0.012191207
## 5 m.cap.TRUE 17 160 -0.176 -0.231357933 -0.017135804
## 6 m.escal.FALSE 9 170 -0.164 -0.218126656 -0.016977931
## 7 m.escal.TRUE 27 590 -0.067 -0.106080866 -0.024092720
## 8 m.mont.FALSE 70 868 -0.047 -0.078139272 -0.025099276
## 9 m.mont.TRUE 114 1175 -0.002 -0.003389105 -0.001433481
## 10 m.none.FALSE 7 71 0.066 0.072409049 0.002862997
## 11 none.none.FALSE 5 16 -0.300 -0.147087026 -0.002925046