Search code examples
rregexbackreference

How to replace exact number of characters in string based on occurrence between delimitors in R


I have text strings like this:

u <- "she goes ~Wha::?~ and he's like ~↑Yeah believe me!~ and she's etc."

What I'd like to do is replace all characters occurring between pairs of ~ delimitors (including the delimitors themselves) by, say, X.

This gsub method replaces the substrings between ~-delimitor pairs with a single X:

gsub("~[^~]+~", "X", u)
[1] "she goes X and he's like X and she's etc."

However, what I'd really like to do is replace each and every single character between the delimitors (and the delimitors themselves) by X. The desired output is this:

"she goes XXXXXXXXX and he's like XXXXXXXXXXXXXXXXXXX and she's etc."

I've been experimenting with nchar, backreference, and paste as follows but the result is incorrect:

gsub("(~[^~]+~)", paste0("X{", nchar("\\1"),"}"), u)
[1] "she goes X{2} and he's like X{2} and she's etc."

Any help is appreciated.


Solution

  • The paste0("X{", nchar("\\1"),"}") code results in X{2} because "\\1" is a string of length 2. \1 is not interpolated as a backreference if you do not use it in a string pattern.

    You can use the following solution based on stringr:

    > u <- "she goes ~Wha::?~ and he's like ~↑Yeah believe me!~ and she's etc."
    > str_replace_all(u, '~[^~]+~', function(x) str_dup("X", nchar(x)))
    [1] "she goes XXXXXXXX and he's like XXXXXXXXXXXXXXXXXXX and she's etc."
    

    Upon finding a match with ~[^~]+~, the value is passed to the anonymous function and str_dup creates a string out of X that is the same length as the match value.