Search code examples
rcollocation

Extracting collocates from texts/sentences


I have a large number of sentences, each of them containing at least one occurrence of 'well'. I'd like to get a list of the two words occurring immediately to the left of 'well' and the two words immediately to the right of 'well'. For example, in the sentence

"very well they all three get on well together"

the result should be for left: "NA" "very" "get" "on"

and for right: "they" "all" "together" "NA"

I do suspect that sub() will be useful and regexes but don't know (exactly) how to assemble the query. How can it be done?


Solution

  • A combination of quanteda and tidyr will get you there. I left the library calls out so you can see which statement comes from which package.

    text <- "very well they all three get on well together"
    
    library(magrittr)
    
    text %>% 
      quanteda::kwic("well", window = 2) %>% 
      data.frame() %>% 
      tidyr::separate(pre, into = c("pre1", "pre2"), fill = "left") %>% 
      tidyr::separate(post, into = c("post1", "post2"), fill = "right")
    
      docname from to pre1 pre2 keyword    post1 post2
    1   text1    2  2 <NA> very    well     they   all
    2   text1    8  8  get   on    well together  <NA>