Search code examples
rregexdplyrstringrstringi

Recursive stringi commands


I am cleaning some string data using some stringi functions as part of a pipe.

I would like these functions to be recursive, so that they tackle all the possible occurrences of a re, not only the first one. I cannot predict ex ante the number of times I would need to run the function to properly clean the data.

library(stringi)

test_1 <- "AAA A B BBB"
str_squish(str_remove(x, "\\b[A-Z]\\b"))
result <- "AAA B BBB"
desired <- "AAA BBB"

test_2 <- "AAA AA BBB BB CCCC"
str_replace(test_2,"(?<=\\s[A-Z]{2,3})\\s","")
result <- "AAA AABBB BB CCCC"
desired <- "AAA AABBB BBCCCC"

Solution

  • Maybe using gsub, which will perform replacement of all matches:

    test_1 <- "AAA A B BBB"
    gsub(" +", " ", gsub("\\b[A-Z]\\b", "", test_1))
    #[1] "AAA BBB"
    
    test_2 <- "AAA AA BBB BB CCCC"
    gsub("(?<=\\s[A-Z]{2})\\s", "", test_2, perl=TRUE)
    #[1] "AAA AABBB BBCCCC"
    

    For the regex (?<=\\s[A-Z]{2,3})\\s its not clear when the condition of 2-3 should be observed and from where you are starting: E.g. stringr::str_replace_all would give:

    stringr::str_replace_all(test_2,"(?<=\\s[A-Z]{2,3})\\s","")
    #[1] "AAA AABBBBBCCCC"
    

    Also you can use a recursive function call:

    f <- function(x) {
      y <- stringr::str_replace(x, "(?<=\\s[A-Z]{2,3})\\s","")
      if(x == y) x
      else f(y)
    }
    f(test_2)
    #[1] "AAA AABBB BBCCCC"