I have some text where people use capitals with spaces in between to make the substring standout. I want to replace the spaces between these substrings. The rules for the pattern is: "at least 3 consecutive capital letters with a space between each letter".
I'm curious how to do this with pure regex but also with the gsubfn package as I thought this would be an easy job for it but in MWE example below I crashed and burned as an extra letter was placed in there (I'm curious why this is happening).
x <- c(
'Welcome to A I: the best W O R L D!',
'Hi I R is the B O M B for sure: we A G R E E indeed.'
)
## first to show I have the right regex pattern
gsub('(([A-Z]\\s+){2,}[A-Z])', '<FOO>', x)
## [1] "Welcome to A I: the best <FOO>!"
## [2] "Hi I R is the <FOO> for sure: we <FOO> indeed."
library(gsubfn)
spacrm1 <- function(string) {gsub('\\s+', '', string)}
gsubfn('(([A-Z]\\s+){2,}[A-Z])', spacrm1, x)
## Error in (function (string) : unused argument ("L ")
## "Would love to understand why this error is happening"
spacrm2 <- function(...) {gsub('\\s+', '', paste(..., collapse = ''))}
gsubfn('(([A-Z]\\s+){2,}[A-Z])', spacrm2, x)
## [1] "Welcome to A I: the best WORLDL!"
## [2] "Hi I R is the BOMBM for sure: we AGREEE indeed."
## "Would love to understand why the extra letter is happening"
[1] "Welcome to A I: the best WORLD!"
[2] "Hi I R is the BOMB for sure: we AGREE indeed."
As I pointed out in the comments the problem in the first gsubfn call in the question arises from there being two capture groups in the regex yet only one argument to the function. These need to match -- two capture groups implies a need for two arguments. We can see what gsubfn is passing by running this and viewing the print statement's output:
junk <- gsubfn('(([A-Z]\\s+){2,}[A-Z])', ~ print(list(...)), x)
We can address this in any of the following ways:
1) This uses the regex from the question but uses a function that accepts multiple arguments. Only the first argument is actually used in the function.
gsubfn('(([A-Z]\\s+){2,}[A-Z])', ~ gsub("\\s+", "", ..1), x)
## [1] "Welcome to A I: the best WORLD!"
## [2] "Hi I R is the BOMB for sure: we AGREE indeed."
Note that it interprets the formula as the function:
function (...) gsub("\\s+", "", ..1)
We can view the function generated from the formula like this:
fn$identity( ~ gsub("\\s+", "", ..1) )
## function (...)
## gsub("\\s+", "", ..1)
2) This uses the regex from the question and also the function from the question but adds the backref = -1 argument which tells it to pass only the first capture group to the function -- the minus means do not pass the entire match either.
gsubfn('(([A-Z]\\s+){2,}[A-Z])', spacrm1, x, backref = -1)
(As @Wiktor Stribiżew points out in his answer backref=0
would also work.)
3) Another way to express this using the regex from the question is:
gsubfn('(([A-Z]\\s+){2,}[A-Z])', x + y ~ gsub("\\s+", "", x), x)
Note that it interprets the formula as this function:
function(x, y) gsub("\\s+", "", x)