I have survey data. Some questions allowed for multiple answers. In my data, the different answers are separated by a comma. I want to add a new row in the dataframe for each choice. So I have something like this:
survey$q1 <- c("I like this", "I like that", "I like this, but not much",
"I like that, but not much", "I like this,I like that",
"I like this, but not much,I like that")
If commas were only there to divide the multiple choices I'd use:
survey <- cSplit(survey, "q1", ",", direction = "long")
and get the desired result. Given some commas are part of the answer, I tried using comma followed by capital letter as a divider:
survey <- cSplit(survey, "q1", ",(?=[A-Z])", direction = "long")
But for some reason it does not work. It does not give any error, but it does not split strings and also it removes some rows from the dataframe. I then tried using strsplit:
strsplit(survey$1, ",(?=[A-Z])", perl=T)
which works in splitting it correctly, but I'm not able to implement it so that each sentence becomes a different row of the same column, like cSplit does. The required output is:
survey$q1
[1] "I like this"
[2] "I like that"
[3] "I like this, but not much"
[4] "I like that, but not much"
[5] "I like this"
[6] "I like that"
[7] "I like this, but not much"
[8] "I like that"
Is there a way I can get it using one of the 2 methods? Thank you
An option with separate_rows
library(dplyr)
library(tidyr)
survey %>%
separate_rows(q1, sep=",(?=[A-Z])")
# q1
#1 I like this
#2 I like that
#3 I like this, but not much
#4 I like that, but not much
#5 I like this
#6 I like that
#7 I like this, but not much
#8 I like that
With cSplit
, there is an argument fixed
which is TRUE
by default, but if we use fixed = FALSE
, it may fail. May be because it is not optimized for PCRE regex expressions
library(splitstackshape)
cSplit(survey, "q1", ",(?=[A-Z])", direction = "long", fixed = FALSE)
Error in strsplit(indt[[splitCols[x]]], split = sep[x], fixed = fixed) : invalid regular expression ',(?=[A-Z])', reason 'Invalid regexp'
One option to bypass it would be to modify the column with a function (sub/gsub
) that can take PCRE regex to change the sep
and then use cSplit
on that sep
cSplit(transform(survey, q1 = sub(",(?=[A-Z])", ":", q1, perl = TRUE)),
"q1", sep=":", direction = "long")
# q1
#1: I like this
#2: I like that
#3: I like this, but not much
#4: I like that, but not much
#5: I like this
#6: I like that
#7: I like this, but not much
#8: I like that
survey <- structure(list(q1 = c("I like this", "I like that", "I like this, but not much",
"I like that, but not much", "I like this,I like that", "I like this, but not much,I like that"
)), class = "data.frame", row.names = c(NA, -6L))