Search code examples
rtidyversetext-analysistidytextlexicon

With text analysis inner_join removes more than a thousand words in R


I'm analysing a column with words in my most_used_words dataframe. With 2180 words.

most_used_words

        word times_used
       <chr>      <int>
 1    people         70
 2      news         69
 3      fake         68
 4   country         54
 5     media         44
 6       u.s         42
 7  election         40
 8      jobs         37
 9       bad         36
10 democrats         35
# ... with 2,170 more rows

When I inner_join with the AFINN lexicon only 364 of the 2180 words are scored. Is this because the words in the in the AFINN lexicon don't appear in my dataframe? I'm affraid if that's the case this may introduce bias in my analysis. Should I use a different lexicon? Is there something else that's happening?

library(tidytext)
library(tidyverse)    

afinn <- get_sentiments("afinn")

most_used_words %>%
  inner_join(afinn)   

          word times_used score
         <chr>      <int> <int>
     1    fake         68    -3
     2     bad         36    -3
     3     win         24     4
     4 failing         21    -2
     5    hard         20    -1
     6  united         19     1
     7 illegal         17    -3
     8    cuts         15    -1
     9   badly         13    -3
    10 strange         13    -1
    # ... with 354 more rows

Solution

  • "Is this because the words in the in the AFINN lexicon don't appear in my dataframe?" 
    

    Yes.

    An inner join will only return matching rows (words) from each data.frame. You can try a different lexicon, sure, but that might not help you with nouns. A noun identifies a person, animal, place, thing, or idea. In your example above, "u.s.", "people", "country", "news", "democrats" are all nouns that don't exist in afinn. None of these have any sentiment without context. Welcome to the world of text analysis.

    However, based on the output displayed from you analysis, I think you can conclude the sentiment of your column of words is overwhelmingly "negative". The word "fake" appears nearly twice as much as the next most used word, which is "bad".

    If you had complete sentences, you can gain context by using the the sentimentr r package. Check it out:

    install.packages("sentimentr")
    library(sentimentr)
    ?sentiment
    

    It will take more work than what you've done here, and will produce richer results. But in the end, they will likely be the same. Good luck.