Search code examples
rstrsplitcsplit

Split string into multiple rows by capital letters with cSplit


I have survey data. Some questions allowed for multiple answers. In my data, the different answers are separated by a comma. I want to add a new row in the dataframe for each choice. So I have something like this:

survey$q1 <- c("I like this", "I like that", "I like this, but not much",
 "I like that, but not much", "I like this,I like that", 
"I like this, but not much,I like that")

If commas were only there to divide the multiple choices I'd use:

survey <- cSplit(survey, "q1", ",", direction = "long")

and get the desired result. Given some commas are part of the answer, I tried using comma followed by capital letter as a divider:

survey <- cSplit(survey, "q1", ",(?=[A-Z])", direction = "long")

But for some reason it does not work. It does not give any error, but it does not split strings and also it removes some rows from the dataframe. I then tried using strsplit:

strsplit(survey$1, ",(?=[A-Z])", perl=T)

which works in splitting it correctly, but I'm not able to implement it so that each sentence becomes a different row of the same column, like cSplit does. The required output is:

survey$q1
[1] "I like this"
[2] "I like that"
[3] "I like this, but not much"
[4] "I like that, but not much"
[5] "I like this"
[6] "I like that"
[7] "I like this, but not much"
[8] "I like that"

Is there a way I can get it using one of the 2 methods? Thank you


Solution

  • An option with separate_rows

    library(dplyr)
    library(tidyr)
    survey %>% 
       separate_rows(q1, sep=",(?=[A-Z])")
    #                       q1
    #1               I like this
    #2               I like that
    #3 I like this, but not much
    #4 I like that, but not much
    #5               I like this
    #6               I like that
    #7 I like this, but not much
    #8               I like that
    

    With cSplit, there is an argument fixed which is TRUE by default, but if we use fixed = FALSE, it may fail. May be because it is not optimized for PCRE regex expressions

    library(splitstackshape)
    cSplit(survey, "q1", ",(?=[A-Z])", direction = "long", fixed = FALSE)
    

    Error in strsplit(indt[[splitCols[x]]], split = sep[x], fixed = fixed) : invalid regular expression ',(?=[A-Z])', reason 'Invalid regexp'

    One option to bypass it would be to modify the column with a function (sub/gsub) that can take PCRE regex to change the sep and then use cSplit on that sep

    cSplit(transform(survey, q1 = sub(",(?=[A-Z])", ":", q1, perl = TRUE)), 
             "q1", sep=":", direction = "long")
    #                        q1
    #1:               I like this
    #2:               I like that
    #3: I like this, but not much
    #4: I like that, but not much
    #5:               I like this
    #6:               I like that
    #7: I like this, but not much
    #8:               I like that
    

    data

    survey <- structure(list(q1 = c("I like this", "I like that", "I like this, but not much", 
    "I like that, but not much", "I like this,I like that", "I like this, but not much,I like that"
    )), class = "data.frame", row.names = c(NA, -6L))