Search code examples
rregexregex-lookarounds

Regex in R: match collocates of node word


I want to find collocates of a word in text strings. A word's collocates are those words that co-occur with it either preceding or following it. Here's a made-up example:

GO <- c("This little sentence went on and on.", 
        "It was going on for quite a while.", 
        "In fact it has been going on for ages.", 
        "It still goes on.", 
        "It would go on even if it didn't.")

Let's say I'm interested in the words collocating with the lemma GO including all the forms the verb 'go' can take, namely 'go', 'went', 'gone', 'goes', and 'going', and I want to extract both collocates on the left and the right of GO using str_extract from package stringrand assemble the collocates in a dataframe. This is all well as far as single-word collocates are concerned. I can do it like this:

collocates <- data.frame(
  Left = str_extract(GO, "\\w+\\b\\s(?=(go(es|ing|ne)?|went))"),
  Node = str_extract(GO, "go(es|ing|ne)?|went"),
  Right = str_extract(GO, "(?<=go(es|ing|ne)?|went)\\s\\w+\\b"))

That's the result:

collocates
       Left  Node Right
1 sentence   went    on
2      was  going    on
3     been  going    on
4    still   goes    on
5    would     go    on

But I'm interested not just in the one word before and after GO but, say, in up to three words before and after GO. Now using quantifier expressions gets me closer to the desired result but not quite there:

collocates <- data.frame(
  Left = str_extract(GO, "(\\w+\\b\\s){0,3}(?=(go(es|ing|ne)?|went))"),
  Node = str_extract(GO, "go(es|ing|ne)?|went"),
  Right = str_extract(GO, "(?<=go(es|ing|ne)?|went)(\\s\\w+\\b){0,3}"))

And that's the result now:

collocates
                   Left  Node       Right
1 This little sentence   went   on and on
2               It was  going            
3          it has been  going            
4             It still   goes            
5    It probably would     go  on even if

While the collocates on the left side are all as desired, the collocates on the right side are partially missing. Why is that? And how can the code be changed to match all collocates correctly?

Expected output:

                   Left  Node         Right
1 This little sentence   went     on and on
2               It was  going  on for quite
3          it has been  going   on for ages
4             It still   goes            on
5             It would     go    on even if

Solution

  • Using the quantifier {0,3} (meaning match between 0 and 3 of the preceding token) will simply allow the first word in the match group to be skipped if the maximum isn't reached.

    Regular expression visualization

    r <- data.frame(
      Right = str_extract(GO, "(?<=go(es|ing|ne)?|went)(\\s\\w+\\b){0,3}"))
    

    Debuggex Demo

    Including a minimum quantifier of 1 you can guarantee that if there is at least one word to the right of the first match group then it will be captured. With zero it will skip over the first word and proceed to capture whatever is remaining up to the maximum specified.

    r <- data.frame(
      Right = str_extract(GO, "(?<=go(es|ing|ne)?|went)(\\s\\w+\\b){1,3}"))
    

    Regular expression visualization

    Debuggex Demo

    This can be further demonstrated by playing with the quantifier values and observing the following:

    r <- data.frame(
      Right = str_extract(GO, "(?<=go(es|ing|ne)?|went)(\\s\\w+\\b){2,2}"))
    
    print(r)
    
         Right
    1   on and
    2   on for
    3   on for
    4     <NA>
    5  on even
    

    In the example above we chose {2,2}, (minimum 2 and maximum 2); since there weren't enough words to capture exactly 2 in the 4th row we get <NA>.