Search code examples
rregexstringr

Remove everything string after and including the nth occurence in a text


Admitting that these regex questions have been asked before, I'm still struggling in getting a working solution (even after consulting ChatGPT).

Taking the following example: text <- c("test1", "test2 | ", "test3 | test3 | test 3", "test4 | test4 | test 4 | test4") I want to remove all text beginning from the n-th (in my case second) occurence of " | ".

So the output should be: output <- c("test1", "test2 | ", "test3 | test3", "test4 | test4")

I got it working for the case when there are up to two " | " texts with str_remove(text, "( \\| [^\\|]+$)"), but this doesn't generalize for cases with more then two occurences of this matching pattern.


Solution

  • You can use

    library(stringr)
    n <- 2
    str_replace(text, paste0("^(.*?(?: \\| .*?){", n-1, "}) \\| .*"), "\\1")
    

    where

    • \| is your delimiter
    • .*? matches any text (other than line break chars, add (?s) at the start of the pattern to make it match across lines)
    • str_replace is required to keep the first group value after removing the match value.

    See the R demo online (and here is the resulting regex demo).