Search code examples
regexrstringstringi

Split keep repeated delimiter


I'm trying to use the stringi package to split on a delimiter (potentially the delimiter is repeated) yet keep the delimiter. This is similar to this question I asked moons ago: R split on delimiter (split) keep the delimiter (split) but the delimiter can be repeated. I don't think base strsplit can handle this type of regex. The stringi package can but I can't figure out how to format the regex to it splits on the delimiter if there are repeats and also not to leave an empty string at the end of the string.

Base R solutions, stringr, stringi etc. solutions all welcomed.

The later problem occurs because I use greedy * on the \\s but the space isn't garunteed so I could only think to leave it in:

MWE

text.var <- c("I want to split here.But also||Why?",
   "See! Split at end but no empty.",
   "a third string.  It has two sentences"
)

library(stringi)   
stri_split_regex(text.var, "(?<=([?.!|]{1,10}))\\s*")

# Outcome

## [[1]]
## [1] "I want to split here." "But also|"     "|"          "Why?"                 
## [5] ""                     
## 
## [[2]]
## [1] "See!"       "Split at end but no empty." ""                          
## 
## [[3]]
## [1] "a third string."      "It has two sentences"

# Desired Outcome

## [[1]]
## [1] "I want to split here." "But also||"                     "Why?"                                  
## 
## [[2]]
## [1] "See!"         "Split at end but no empty."                         
## 
## [[3]]
## [1] "a third string."      "It has two sentences"

Solution

  • Using strsplit

     strsplit(text.var, "(?<=[.!|])( +|\\b)", perl=TRUE)
     #[[1]]
     #[1] "I want to split here." "But also||"            "Why?"                 
    
     #[[2]]
     #[1] "See!"                       "Split at end but no empty."
    
     #[[3]]
     #[1] "a third string."      "It has two sentences"
    

    Or

     library(stringi)
     stri_split_regex(text.var, "(?<=[.!|])( +|\\b)")
     #[[1]]
     #[1] "I want to split here." "But also||"            "Why?"                 
    
     #[[2]]
     #[1] "See!"                       "Split at end but no empty."
    
     #[[3]]
     #[1] "a third string."      "It has two sentences"