(This is a follow-up to Regex in R: match collocates of node word.)
I want to extract word combinations (collocates) to the left and to the right of a target word (node) and store the three elements in a dataframe.
Data:
GO <- c("This little sentence went on and went on. It was going on for quite a while. Going on for ages. It's still going on. And will go on and on, and go on forever.")
Aim:
The target word is the verb GO in any of its possible realizations, be it 'go', 'going', goes', 'gone, or 'went' and I'm interested in extracting 3 words to the left of GO and to the right of GO. The three words can cross sentence boundaries but the extracted strings should not include punctuation.
What I've tried so far:
To extract left-hand collocates I've used str_extract_all
from stringr
:
unlist(str_extract_all(GO, "((\\s)?\\w+\\b){1,3}(?=\\s((g|G)o(es|ing|ne)?|went))"))
[1] "This little sentence" " went on and" " It was" "s still"
[5] " And will" " and"
This captures most but not all matches and includes spaces. The extraction of the node, by contrast, looks okay:
unlist(str_extract_all(GO, "(g|G)o(es|ing|ne)?|went"))
[1] "went" "went" "going" "Going" "going" "go" "go"
To extract the right hand collocates:
unlist(str_extract_all(GO, "(?<=(g|G)o(es|ing|ne)?|went)(\\s\\w+\\b){1,3}"))
[1] " on and went" " on" " on for quite" " on for ages" " on" " on and on"
[7] " on forever"
Again the matches are incomplete and unwanted spaces are included. And finally assembling all the matches in a dataframe throws an error:
collocates <- data.frame(
Left = unlist(str_extract_all(GO, "((\\s)?\\w+\\b){1,3}(?=\\s((g|G)o(es|ing|ne)?|went))")),
Node = unlist(str_extract_all(GO, "(g|G)o(es|ing|ne)?|went")),
Right = unlist(str_extract_all(GO, "(?<=(g|G)o(es|ing|ne)?|went)(\\s\\w+\\b){1,3}"))); collocates
Error in data.frame(Left = unlist(str_extract_all(GO, "((\\s)?\\w+\\b){1,3}(?=\\s((g|G)o(es|ing|ne)?|went))")), :
arguments imply differing number of rows: 6, 7
Expected output:
Left Node Right
This little sentence went on and went
went on and went on It was
on It was going on for quite
quite a while Going on for ages
ages It’s still going on And will
on And will go on and on
and on and go on forever
Does anyone know how to fix this? Suggestions much appreciated.
If you use Quanteda, you can get the following result. When you deal with texts, you want to use small letters. I converted capital letters with tolower()
. I also removed .
and ,
with gsub()
. Then, I applied kwic()
to the text. If you do not mind losing capital letters, dots, and commas, you get pretty much what you want.
library(quanteda)
library(dplyr)
library(splitstackshape)
myvec <- c("go", "going", "goes", "gone", "went")
mytext <- gsub(x = tolower(GO), pattern = "\\.|,", replacement = "")
mydf <- kwic(x = mytext, pattern = myvec, window = 3) %>%
as_tibble %>%
select(pre, keyword, post) %>%
cSplit(splitCols = c("pre", "post"), sep = " ", direction = "wide", type.convert = FALSE) %>%
select(contains("pre"), keyword, contains("post"))
pre_1 pre_2 pre_3 keyword post_1 post_2 post_3
1: this little sentence went on and went
2: went on and went on it was
3: on it was going on for quite
4: quite a while going on for ages
5: ages it's still going on and will
6: on and will go on and on
7: and on and go on forever <NA>