Search code examples
rregexstringsplit

Regex expression to match every nth occurence of a pattern


Consider this string,

str = "abc-de-fghi-j-k-lm-n-o-p-qrst-u-vw-x-yz"

I'd like to separate the string at every nth occurrence of a pattern, here -:

f(str, n = 2)
[1] "abc-de" "fghi-j" "k-lm" "n-o"...

f(str, n = 3)
[1] "abc-de-fghi" "j-k-lm" "n-o-p" "qrst-u-vw"...

I know I could do it like this:

spl <- str_split(str, "-", )[[1]]
unname(sapply(split(spl, ceiling(seq(spl) / 2)), paste, collapse = "-"))
[1] "abc-de" "fghi-j" "k-lm"   "n-o"    "p-qrst" "u-vw"   "x-yz" 

But I'm looking for a shorter and cleaner solution

What are the possibilities?


Solution

  • What about the following (where 'n-1' is a placeholder for a number):

    (?:[^-]*(?:-[^-]*){n-1})\K-
    

    See an online demo


    • (?: - Open 1st non-capture group;
      • [^-]* - Match 0+ characters other hyphen;
      • (?: - Open a nested 2nd non-capture group;
        • -[^-]* - Match an hyphen and 0+ characters other than hyphen;
        • ){n} - Close nested non-capture group and match n-times;
      • ) - Close 1st non-capture group;
    • \K- - Forget what we just matched and match the trailing hyphen.

    Note: The use of \K means we must use PCRE (perl=TRUE)


    To create the 'n-1' we can use sprintf() functionality to use a variable:

    str <- "abc-de-fghi-j-k-lm-n-o-p-qrst-u-vw-x-yz"
    for (n in 1:10) {
      print(strsplit(str, sprintf("(?:[^-]*(?:-[^-]*){%s})\\K-", n-1), perl=TRUE)[[1]])
    }
    

    Prints:

    enter image description here