Search code examples
rsentiment-analysisqdap

sentiment analysis with different number of documents


I am trying to do sentiment analysis on newspaper articles and track the sentiment level across time. To do that, basically I will identify all the relevant news articles within a day, feed them into the polarity() function and obtain the average polarity scores of all the articles (more precisely, the average of all the sentence from all the articles) within that day.

The problem is, for some days, there will be many more articles compared to other days, and I think this might mask some of the info if we simply track the daily average polarity score. For example, a score of 0.1 from 30 news articles should carry more weight compared to a score of 0.1 generated from only 3 articles. and sure enough, some of the more extreme polarity scores I obtained came from days whereby there are only few relevant articles.

Is there anyway I can take the different number of articles each day into consideration?

library(qdap)
sentence = c("this is good","this is not good")
polarity(sentence)

Solution

  • I would warn that sometimes saying something strong with few words may pack the most punch. Make sure what you're doing makes sense in terms of your data and research questions.

    One approach would be to use number of words as in the following example (I like the first approach moreso here):

    poldat2 <- with(mraja1spl, polarity(dialogue, list(sex, fam.aff, died)))
    
    output <- scores(poldat2)
    weight <- ((1 - (1/(1 + log(output[["total.words"]], base = exp(2))))) * 2) - 1
    weight <- weigth/max(weight)
    weight2 <- output[["total.words"]]/max(output[["total.words"]])
    
    output[["weighted.polarity"]] <- output[["ave.polarity"]] * weight   
    output[["weighted.polarity2"]] <- output[["ave.polarity"]] * weight2   
    output[, -c(5:6)]
    
    
    ##    sex&fam.aff&died total.sentences total.words ave.polarity weighted.polarity weighted.polarity2
    ## 1       f.cap.FALSE             158        1641        0.083       0.143583793        0.082504197
    ## 2        f.cap.TRUE              24         206        0.044       0.060969157        0.005564434
    ## 3       f.mont.TRUE               4          29        0.079       0.060996614        0.001397106
    ## 4       m.cap.FALSE              73         651        0.031       0.049163984        0.012191207
    ## 5        m.cap.TRUE              17         160       -0.176      -0.231357933       -0.017135804
    ## 6     m.escal.FALSE               9         170       -0.164      -0.218126656       -0.016977931
    ## 7      m.escal.TRUE              27         590       -0.067      -0.106080866       -0.024092720
    ## 8      m.mont.FALSE              70         868       -0.047      -0.078139272       -0.025099276
    ## 9       m.mont.TRUE             114        1175       -0.002      -0.003389105       -0.001433481
    ## 10     m.none.FALSE               7          71        0.066       0.072409049        0.002862997
    ## 11  none.none.FALSE               5          16       -0.300      -0.147087026       -0.002925046