I have searched and was able to find this forum discussion for achieving the effect of overlapping matches.
I also found the following SO question speaking of finding indexes to perform this task, but was not able to find anything concise about grabbing overlapping matches in the R language.
I can perform this task in most any language that supports (PCRE) by using a Positive Lookahead assertion while implementing a capturing group inside of the lookahead to capture the overlapped matches.
But, while actually performing this the same way I would in other languages, using perl=T
in R, no results yield.
> x <- 'ACCACCACCAC'
> regmatches(x, gregexpr('(?=([AC]C))', x, perl=T))[[1]]
[1] "" "" "" "" "" "" ""
The same goes for using both the stringi
and stringr
package.
> library(stringi)
> library(stringr)
> stri_extract_all_regex(x, '(?=([AC]C))')[[1]]
[1] "" "" "" "" "" "" ""
> str_extract_all(x, perl('(?=([AC]C))'))[[1]]
[1] "" "" "" "" "" "" ""
The correct results that should be returned when executing this are:
[1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"
I am well aware that regmatches
does not work well with captured matches, but what exactly causes this behavior in regmatches and why are no results returned? I am scavenging for a somewhat detailed answer.
Is the stringi
and stringr
package not capable of performing this over regmatches
?
Please feel free to add to my answer or come up with a different workaround than I have found.
The standard regmatches
does not work well with captured matches (specifically multiple captured matches in the same string). And in this case, since you're "matching" a look ahead (ignoring the capture), the match itself is zero-length. There is also a regmatches()<-
function that may illustrate this. Obseerve
x <- 'ACCACCACCAC'
m <- gregexpr('(?=([AC]C))', x, perl=T)
regmatches(x, m) <- "~"
x
# [1] "~A~CC~A~CC~A~CC~AC"
Notice how all the letters are preserved, we've just replaced the locations of the zero-length matches with something we can observe.
I've created a regcapturedmatches() function that I often use for such tasks. For example
x <- 'ACCACCACCAC'
regcapturedmatches(x, gregexpr('(?=([AC]C))', x, perl=T))[[1]]
# [,1] [,2] [,3] [,4] [,5] [,6] [,7]
# [1,] "AC" "CC" "AC" "CC" "AC" "CC" "AC"
The gregexpr
is grabbing all the data just fine so you can extract it from that object anyway you life if you prefer not to use this helper function.