Search code examples
rrecodefactors

Recoding multiple factors using regexp


I have data from a survey, where several questions are in the format

"Do you think that [xxxxxxx]"

The possible answers to the questions are in the format

"I am certain that [xxxxxxx]" "I think it is possible that [xxxxxx]" "I don't know if [xxxxxx]"

and so on.

I would now like to recode these factors so that "I am certain" = 1, "I think it is possible" = 2 and so on. I have been playing with dplyr::recode but it does not seem to work with regular expressions.

For example:

set.seed(12345)

possible_answers <- c(
    "I am certain that", "I think it is possible that",
    "I don't know if is possible that", "I think it is not possible that",
    "I am certain that it is not possible that", "It is impossible for me to know if"
)

num_answers <- 10
survey <- data.frame(
    Q1 = paste(
        sample(possible_answers, num_answers, replace = TRUE),
        "topic 1"
    ),
    Q2 = paste(
        sample(possible_answers, num_answers, replace = TRUE),
        "topic 2"
    ),
    Q3 = paste(
        sample(possible_answers, num_answers, replace = TRUE),
        "topic 3"
    ),
    Q4 = paste(
        sample(possible_answers, num_answers, replace = TRUE),
        "topic 4"
    ),
    Q5 = paste(
        sample(possible_answers, num_answers, replace = TRUE),
        "topic 5"
    )
)

I can do something like

survey %>% 
    mutate_at("Q1", recode,
                "I am certain that topic 1" = 1,
                "I think it is possible that topic 1" = 2,
                "I don't know if is possible that topic 1" = 3,
                "I think it is not possible that topic 1" = 4,
                "I am certain that it is not possible that topic 1" = 5,
                "It is impossible for me to know if topic 1" = 6)

but doing it for all questions would be cumbersome.

I would like to do

survey %>% 
    mutate_at(vars(starts_with("Q")), recode,
                "I am certain that (.*)" = 1,
                "I think it is possible that (.*)" = 2,
                "I don't know if is possible that (.*)" = 3,
                "I think it is not possible that (.*)" = 4,
                "I am certain that it is not possible that (.*)" = 5,
                "It is impossible for me to know if (.*)" = 6)

But this changes everything to NA, because it does not see the strings as regular expressions.


Solution

  • Without the data I can't test, but you should be able to use mutate(across(...)) with case_when() to do this. Note that since "I am certain that" will also match "I am certain that it is not possible", you need to do the latter first so that the search for "I am certain" only catches the positive cases.

    survey %>% 
      mutate(across(starts_with("Q"), 
                    ~case_when(
                      grepl("I am certain that it is not possible that", .x) ~ 5,
                      grepl("I am certain that", .x) ~ 1, 
                      grepl("I think it is possible that", .x) ~ 2, 
                      grepl("I don't know if is possible that", .x) ~ 3, 
                      grepl("I think it is not possible that", .x) ~ 4,
                      grepl("It is impossible for me to know if", .x) ~ 6)))