I have a tibble and the vectors within the tibble are character strings with a mix of English and Mandarin characters. I want to split the tibble into two, with one column returning the English, the other column returning the Mandarin. However, I had to resort to the following code in order to accomplish this:
tb <- tibble(x = c("I我", "love愛", "you你")) #create tibble
en <- str_split(tb[[1]], "[^A-Za-z]+", simplify = T) #split string when R reads a character that is not a-z
ch <- str_split(tb[[1]], "[A-Za-z]+", simplify = T) #split string after R reads all the a-z characters
tb <- tb %>%
mutate(EN = en[,1],
CH = ch[,2]) %>%
select(-x)#subset the matrices created above, because the matrices create a column of blank/"" values and also remove x column
tb
I'm guessing there's something wrong with my RegEx that's causing this to occur. Ideally, I would like to write one str_split line that would return both of the columns.
We can use strsplit
from base R
do.call(rbind, strsplit(tb$x, "(?<=[A-Za-z])(?=[^A-Za-z])", perl = TRUE))
Or we can use
library(stringr)
tb$en <- str_extract(tb$x,"[[:alpha:]]+")
tb$ch <- str_extract(tb$x,"[^[:alpha:]]+")