Search code examples
rsplittokenize

Splitting a word into length combination


I'm looking for a function in R that will given a integer allow me to split a word into that length combination but with a rolling effect.

For example function("stackoverflow", 4) would render:

c("stac", "tack", "acko", "ckov", "kove", "over", "verf", "rflo", "flow")

Do you guys know if that function exists or must I create it?


Solution

  • ## install.packages("zoo")
    
    x <- unlist(strsplit("stackoverflow",""))
    zoo::rollapply(x,width=4,FUN = paste0,collapse="")
    # [1] "stac" "tack" "acko" "ckov" "kove" "over" "verf" "erfl" "rflo" "flow"
    

    A function?

    foo <- function(input, h) {
      x <- unlist(strsplit(input,""))
      zoo::rollapply(x,width=h,FUN = paste0,collapse="")
      }
    
    foo("stackoverflow", 4)
    # [1] "stac" "tack" "acko" "ckov" "kove" "over" "verf" "erfl" "rflo" "flow"
    

    A benchmark

    Consider the base R approach with substring():

    foo1 <- function(input, h) substring(input, seq_len(nchar(input)-h+1),h:nchar(input))
    

    Let's generate a very long toy character string:

    x <- paste0(rep("a",100000), collapse="")
    
    system.time(foo(x,4))
    #   user  system elapsed 
    #  2.280   0.004   2.288 
    
    system.time(foo1(x,4))
    #   user  system elapsed 
    # 10.492   0.000  10.509 
    

    So, the seemingly vectorized function substring() is not efficient at all, which is an interesting observation!