Search code examples
rdata-cleaningqualtrics

Converting a column in a wide dataset which is a string of concatenated question names in order they were asked into trial number in a long dataset


I'm dealing with a very annoying data cleaning problem I can't seem to figure out a good way to solve. It comes from the way that Qualtrics records question ordering for surveys when those questions can be in random order.

Essentially, I have a dataset of survey responses which is currently at the level of one participant/row. The columns include id (participant id), 3 columns for questions 1 to 3, and then a column which is the order in which the questions were presented, which is a string for each participant of the order of the questions with pipes seperating each question. Complicating this is that the question names are not super consistent, and have a lot of random punctuation etc. in them -- I tried to replicate the most common oddities in the example below (random underscores in the middle of the questions, non consistent ordering of the questions, etc). The rawData dataframe below shows an example of roughly what the data looks like -- in reality I have a few hundred participants and a few dozen questions.

I would like to get the dataset to the level of participant-question, with one numeric column for participant, one character column for which question they were answering, and one numeric column for which trial the question was (i.e. did they see it first, second?). The dataframe below, finalDataSnippet, shows what I would like it to look like in the end.

rawData = data.frame(id = 1:3, question_1c = c(2,3,4), question_2 = c(4,5,6), ques_tion_3 = c(3,2,4), 'trialOrder' = c("question_1c|question_2|ques_tion_3", "question_2|question_3|ques_tion_1c", "question_3|question_2|ques_tion_1c"))

finalDataSnippet = data.frame(id = c(rep(1,3),rep(2,3), rep(3,3)), question = rep(1:3,3), value = c(2,4,3,3,5,2,4,6,4), trialNumber = c(1,2,3,3,1,2,3,2,1))

I know how to turn the dataset into a long format using tidyr and pivot_longer, but I cannot figure out how to get the trial order variable into a useful format. Below is my truly awful attempt, which doesn't actually even return the right output. Even if it did, it would require that I manually specified each of the question names first, which is not really doable for 60+ questions. (Also, the nested for loops and continually use of c() really seem like they will get inefficient)

checks = c("1", "2", "3")
numberParticipants = 3
results = NA

for(item in 1:length(checks)){
  for(i in 1:numberParticipants){
    results = c(results, Position(function(x) grepl(checks[item],x), trial_orders[[i]]))
  }
}

Any help here would be really appreciated -- I (and presumably others given how common Qualtrics is) have to deal with this a lot.


Solution

  • We could use

    library(dplyr)
    library(tidyr)
    library(stringr)
    rawData %>% 
      select(-trialOrder) %>%
      pivot_longer(cols = starts_with('ques'), names_to = 'question') %>% 
      mutate(question = readr::parse_number(question)) %>% 
      mutate(trialNumber = unlist(str_extract_all(rawData$trialOrder, "\\d+")))
    

    It could be also

    rawData %>% 
        select(-trialOrder) %>%
        pivot_longer(cols = starts_with('ques'), names_to = 'question') %>% 
        mutate(question = readr::parse_number(question)) %>% 
        mutate(trialNumber = unlist(str_extract_all(rawData$trialOrder, "\\d+")))