Search code examples
rregexregex-lookaroundsstringrstrsplit

Split text with stringr::str_split in location preceding specific pattern (numeral or text)


Assuming that I have a table of strings:

df <-tibble::tribble(
 ~ alternatives,
" 23.32 | x232 code | This is a description| 43.11 | a341 code | some other description | optimised | v333 code | still another description" )

I would like to split the string in the locations preceding numeric values: eg. before 23.32, before 43.11, and before the word "optimized".

It is expected that I achieve in each cell the vector:

c(23.32 | x232 code | This is a description|, 43.11 | a341 code | some other description |,  optimised | v333 code | still another description)

What should be the regex pattern to achieve the split before specific patterns? The number of pipe characters between the patterns concerned may differ, I cannot use them reliably. I am vaguely aware of look-ahead etc. This code will not return what I expect but I believe I am looking for a similar solution (this will not do what I want):

df2 <- 
  df %>% 
  mutate(alternatives = 
           str_split(alternatives, 
                     pattern = "(?<=[a-zA-Z])\\s*(?=[0-9])"))
enter code here

What would be the solution?


Solution

  • You may try splitting on the following regex pattern:

    (?<=\S)\s+(?=(?:\d+\.\d+|optimised)\b)
    

    Demo

    Updated script:

    df2 <- df %>% 
        mutate(alternatives = 
            str_split(alternatives, 
                      pattern = "(?<=\\S)\\s+(?=(?:\\d+\\.\\d+|optimised)\\b)"))