Search code examples
rregextibblestrsplit

Str_split is returning only half of the string


I have a tibble and the vectors within the tibble are character strings with a mix of English and Mandarin characters. I want to split the tibble into two, with one column returning the English, the other column returning the Mandarin. However, I had to resort to the following code in order to accomplish this:

    tb <- tibble(x = c("I我", "love愛", "you你")) #create tibble
en <- str_split(tb[[1]], "[^A-Za-z]+", simplify = T) #split string when R reads a character that is not a-z
ch <- str_split(tb[[1]], "[A-Za-z]+", simplify = T) #split string after R reads all the a-z characters
tb <- tb %>%
  mutate(EN = en[,1],
         CH = ch[,2]) %>%
  select(-x)#subset the matrices created above, because the matrices create a column of blank/"" values and also remove x column
tb

I'm guessing there's something wrong with my RegEx that's causing this to occur. Ideally, I would like to write one str_split line that would return both of the columns.


Solution

  • We can use strsplit from base R

    do.call(rbind, strsplit(tb$x, "(?<=[A-Za-z])(?=[^A-Za-z])", perl = TRUE))
    

    Or we can use

    library(stringr)
    tb$en <- str_extract(tb$x,"[[:alpha:]]+")   
    tb$ch <- str_extract(tb$x,"[^[:alpha:]]+")