Search code examples
rnlpquantedatidytext

Extract a 100-Character Window around Keywords in Text Data with R (Quanteda or Tidytext Packages)


This is my first time asking a question on here so I hope I don't miss any crucial parts. I want to perform sentiment analysis on windows of speeches around certain keywords. My dataset is a large csv file containing a number of speeches, but I'm only interest in the sentiment of the words immediately surrounding certain key words.

I was told that the quanteda package in R would likely be my best bet for finding such a function, but I've been unsuccessful in locating it so far. If anyone knows how to do such a task it would be greatly appreciated !!!

Reprex (I hope?) below:

speech = c("This is the first speech. Many words are in this speech, but only few are relevant for my research question. One relevant word, for example, is the word stackoverflow. However there are so many more words that I am not interested in assessing the sentiment of", "This is a second speech, much shorter than the first one. It still includes the word of interest, but at the very end. stackoverflow.", "this is the third speech, and this speech does not include the word of interest so I'm not interested in assessing this speech.")

data <- data.frame(id=1:3, 
                   speechContent = speech)

Solution

  • I'd suggest using tokens_select() with the window argument set to a range of tokens surrounding your target terms.

    To take your example, if "stackoverflow" is the target term, and you want to measure sentiment in the +/- 10 tokens around that, then this would work:

    library("quanteda")
    ## Package version: 3.2.1
    ## Unicode version: 13.0
    ## ICU version: 69.1
    ## Parallel computing: 8 of 8 threads used.
    ## See https://quanteda.io for tutorials and examples.
    
    ## [CODE FROM ABOVE]
    
    corp <- corpus(data, text_field = "speechContent")
    
    toks <- tokens(corp) %>%
      tokens_select("stackoverflow", window = 10)
    toks
    ## Tokens consisting of 3 documents and 1 docvar.
    ## text1 :
    ##  [1] "One"           "relevant"      "word"          ","            
    ##  [5] "for"           "example"       ","             "is"           
    ##  [9] "the"           "word"          "stackoverflow" "."            
    ## [ ... and 9 more ]
    ## 
    ## text2 :
    ##  [1] "word"          "of"            "interest"      ","            
    ##  [5] "but"           "at"            "the"           "very"         
    ##  [9] "end"           "."             "stackoverflow" "."            
    ## 
    ## text3 :
    ## character(0)
    

    There are many ways to compute sentiment from this point. An easy one is to apply a sentiment dictionary, e.g.

    tokens_lookup(toks, data_dictionary_LSD2015) %>%
      dfm()
    ## Document-feature matrix of: 3 documents, 4 features (91.67% sparse) and 1 docvar.
    ##        features
    ## docs    negative positive neg_positive neg_negative
    ##   text1        0        1            0            0
    ##   text2        0        0            0            0
    ##   text3        0        0            0            0