Search code examples
rtextnlptidyversequanteda

In R, how to find the locations of all dictionary words, in a dataframe?


I'm analyzing corporate meetings, and I want to measure at what time people in the meetings bring up certain topics. Time meaning the location of the words.

For example, in three meetings, when do people bring up "unionizing" and other words in my dictionary?

df <- data.frame(text = c("we're meeting here today to talk about our earnings. we will also discuss unionizing efforts.", "hi all, unionizing and the on-going strike is at the top of our agenda, because unionizing threatens our revenue goals.", "we will discuss unionizing tomorrow, today the focus is our Q3 earnings"))

dict <- c("unions", "strike", "unionizing")

Desired output:

text count word
we're meeting here today... (location of word) unionizing
hi all, unionizing an... (location of word) unionizing
hi all, unionizing an... (location of word) strike
hi all, unionizing an... (location of word) unionizing
we will discuss unionizing tomorrow... (location of word) unionizing

I asked a question about finding the first time a word is used, here, and I tried to modify the code, but was unsuccessful.


Solution

  • Using quanteda:

    Fist tokenize and remove the punctuation, otherwise punctuation will be counted as a token. The advantage of using kwic is that you can easily see which words came before and after the word(s) you are looking for.

    library(quanteda)
    
    x <- kwic(tokens(df$text, remove_punct = T), dict)
    data.frame(x)
    
      docname from to                             pre    keyword                        post    pattern
    1   text1   14 14   earnings we will also discuss unionizing                     efforts unionizing
    2   text2    3  3                          hi all unionizing  and the on-going strike is unionizing
    3   text2    7  7 all unionizing and the on-going     strike            is at the top of     strike
    4   text2   16 16       top of our agenda because unionizing threatens our revenue goals unionizing
    5   text3    4  4                 we will discuss unionizing tomorrow today the focus is unionizing