Search code examples
rstringstringrstringigsubfn

How to get str_sub to accept output from str_locate_all when there are multiple replacements in a string and also assign replacements, vectorized


There are a lot of string replacement questions, but I could not find one that addressed this issue specifically. I have a too long and slow if else for loop to solve this problem, but according to the str_sub documentation, the matrix output of str_locate_all should cleanly pass to str_sub once in matrix form. I want to pass multiple matrices and assign multiple values simultaneously when a string has the pattern occurring more than once. So something like below vectorized.

str_sub(text1, matrix_output) <- unlist(replacements)

Here is the text I am using for example:

text1 <- c("The current year is 2016 and the month is 05", "A following month is 08 with year = 2017", "There are other years.", "The final year will be 2053")
replacements <- list(r=c('2022','08'), r =c('09','2023'), r = '3167')

To get the matrix output you can run:

matrix_output <- str_locate_all(text1, pattern = '\\d{2,4}')
matrix_output <- matrix(as.matrix(matrix_output), ncol=2)

The desired output is:

[1] "The current year is 2022 and the month is 08"
[2] "A following month is 09 with year = 2023" 
[3] "There are other years." 
[4] "The final year will be 3167" 

I am open to using other functions with str_locate_all's output, such as mgsub or gsubfn.

I tried using various combinations of str_replace_all, gsubfn, and mgsub with str_locate_all to solve the problem, but all involved loops.

I also looked at gsubfn, and in particular this post seems helpful. But this post applies to the situation where you already have the substrings to be replaced, so it skips the step of getting the actual substrings using str_locate_all.


Solution

  • Here is an option with substring assignment using the location matrix from 'str_locate_all'.

    library(stringr)
    library(stringi)
    matrix_output <- str_locate_all(text1, pattern = '\\d{2,4}')
    i1 <- lengths(matrix_output) > 0
    names(replacements) <- which(i1)
    for(i in which(i1)) stri_sub_all(text1[i],
          matrix_output[[i]]) <- replacements[[as.character(i)]]
    

    -output

    > text1
    [1] "The current year is 2022 and the month is 08"
    [2] "A following month is 09 with year = 2023"  
    [3]  "There are other years."                      
    [4] "The final year will be 3167"   
    

    The above solution can be vectorized further if it is a single string

    text2 <- str_c(text1, collapse = ";")
     matrix_output <- str_locate_all(text2, "\\d{2,4}")[[1]]
    stri_sub_all(text2, matrix_output) <- unlist(replacements)
    text1 <- strsplit(text2, ";")[[1]]
    

    -output

    text1
    [1] "The current year is 2022 and the month is 08" 
    [2] "A following month is 09 with year = 2023"    
    [3] "There are other years."                      
    [4] "The final year will be 3167" 
    

    Or another option is to extract the digits and then use a named vector for replacement

    library(gsubfn)
     nm1 <- setNames(unlist(replacements), unlist(str_extract_all(text1, "\\d{2,4}")))
    gsubfn("\\d+", as.list(nm1), text1)
    [1] "The current year is 2022 and the month is 08"
    [2] "A following month is 09 with year = 2023"   
    [3]  "There are other years."                      
    [4] "The final year will be 3167"