Search code examples
rregexregex-lookarounds

Regex in R: how to fill dataframe with multiple matches to left and right of target string


(This is a follow-up to Regex in R: match collocates of node word.)

I want to extract word combinations (collocates) to the left and to the right of a target word (node) and store the three elements in a dataframe.

Data:

GO <- c("This little sentence went on and went on. It was going on for quite a while. Going on for ages. It's still going on. And will go on and on, and go on forever.")

Aim:

The target word is the verb GO in any of its possible realizations, be it 'go', 'going', goes', 'gone, or 'went' and I'm interested in extracting 3 words to the left of GO and to the right of GO. The three words can cross sentence boundaries but the extracted strings should not include punctuation.

What I've tried so far:

To extract left-hand collocates I've used str_extract_all from stringr:

unlist(str_extract_all(GO, "((\\s)?\\w+\\b){1,3}(?=\\s((g|G)o(es|ing|ne)?|went))"))
[1] "This little sentence" " went on and"         " It was"              "s still"             
[5] " And will"            " and"

This captures most but not all matches and includes spaces. The extraction of the node, by contrast, looks okay:

unlist(str_extract_all(GO, "(g|G)o(es|ing|ne)?|went"))
[1] "went"  "went"  "going" "Going" "going" "go"    "go"

To extract the right hand collocates:

unlist(str_extract_all(GO, "(?<=(g|G)o(es|ing|ne)?|went)(\\s\\w+\\b){1,3}"))
[1] " on and went"  " on"           " on for quite" " on for ages"  " on"           " on and on"   
[7] " on forever"

Again the matches are incomplete and unwanted spaces are included. And finally assembling all the matches in a dataframe throws an error:

collocates <- data.frame(
  Left = unlist(str_extract_all(GO, "((\\s)?\\w+\\b){1,3}(?=\\s((g|G)o(es|ing|ne)?|went))")),
  Node = unlist(str_extract_all(GO, "(g|G)o(es|ing|ne)?|went")),
  Right = unlist(str_extract_all(GO, "(?<=(g|G)o(es|ing|ne)?|went)(\\s\\w+\\b){1,3}"))); collocates
Error in data.frame(Left = unlist(str_extract_all(GO, "((\\s)?\\w+\\b){1,3}(?=\\s((g|G)o(es|ing|ne)?|went))")),  : 
      arguments imply differing number of rows: 6, 7

Expected output:

Left                    Node    Right
This little sentence    went    on and went
went on and             went    on It was
on It was              going    on for quite
quite a while          Going    on for ages
ages It’s still        going    on And will
on And will               go    on and on
and on and                go    on forever

Does anyone know how to fix this? Suggestions much appreciated.


Solution

  • If you use Quanteda, you can get the following result. When you deal with texts, you want to use small letters. I converted capital letters with tolower(). I also removed . and , with gsub(). Then, I applied kwic() to the text. If you do not mind losing capital letters, dots, and commas, you get pretty much what you want.

    library(quanteda)
    library(dplyr)
    library(splitstackshape)
    
    myvec <- c("go", "going", "goes", "gone", "went")
    
    mytext <- gsub(x = tolower(GO), pattern = "\\.|,", replacement = "")
    
    mydf <- kwic(x = mytext, pattern = myvec, window = 3) %>% 
            as_tibble %>%
            select(pre, keyword, post) %>% 
            cSplit(splitCols = c("pre", "post"), sep = " ", direction = "wide", type.convert = FALSE) %>% 
            select(contains("pre"), keyword, contains("post"))
    
       pre_1  pre_2    pre_3 keyword post_1  post_2 post_3
    1:  this little sentence    went     on     and   went
    2:  went     on      and    went     on      it    was
    3:    on     it      was   going     on     for  quite
    4: quite      a    while   going     on     for   ages
    5:  ages   it's    still   going     on     and   will
    6:    on    and     will      go     on     and     on
    7:   and     on      and      go     on forever   <NA>