I have a large number of sentences, each of them containing at least one occurrence of 'well'. I'd like to get a list of the two words occurring immediately to the left of 'well' and the two words immediately to the right of 'well'. For example, in the sentence
"very well they all three get on well together"
the result should be for left: "NA" "very" "get" "on"
and for right: "they" "all" "together" "NA"
I do suspect that sub() will be useful and regexes but don't know (exactly) how to assemble the query. How can it be done?
A combination of quanteda
and tidyr
will get you there. I left the library calls out so you can see which statement comes from which package.
text <- "very well they all three get on well together"
library(magrittr)
text %>%
quanteda::kwic("well", window = 2) %>%
data.frame() %>%
tidyr::separate(pre, into = c("pre1", "pre2"), fill = "left") %>%
tidyr::separate(post, into = c("post1", "post2"), fill = "right")
docname from to pre1 pre2 keyword post1 post2
1 text1 2 2 <NA> very well they all
2 text1 8 8 get on well together <NA>