Recode matrix values with dplyr based on look-up table

I have a matrix with numerous cases and corresponding answers from a questionnaire. The strongly reduced example below (raw_responses) contains the answers of 5 persons to 5 items. Let us assume that these are multiple choice items with 4 possible answers each. If the item was not processed, the person received the code 9.

raw_responses <- data.frame('id' = 1:10,
                            'item_1' =  sample(c(1:4,9), 10, replace = TRUE),
                            'item_2' =  sample(c(1:4,9), 10, replace = TRUE),
                            'item_3' =  sample(c(1:4,9), 10, replace = TRUE),
                            'item_4' =  sample(c(1:4,9), 10, replace = TRUE),
                            'item_5' =  sample(c(1:4,9), 10, replace = TRUE))

The correct answers are stored in a separate table that reflects the entire test design. Below again a strongly reduced variant (design) with only item names and the corresponding correct answers.

design <- data.frame('item' = c('item_1','item_2','item_3','item_4','item_5'),
                     'key' = sample(1:4, 5, replace = TRUE))

Finally, the goal is a table with scored answers. A correct answer is coded with 1, a wrong one with 0 and an "empty" answer with 99. This works for example with the for loop below.

scored_responses <- raw_responses
for(item in colnames(raw_responses)[2:6]) {
  scored_responses[, item] <- ifelse(scored_responses[, item] == design[design$item == item, 'key'], 1, 
                                     ifelse(scored_responses[, item] == 9, 99, 0))
}

However, I was wondering if this would work with a more efficient variant with dplyr (including case_when) and possibly purr. Especially because the very extensive answer table is cleaned up with the help of a longer dplyr-pipe, it would be an advantage if the scoring could be built in there.

I thank you in advance for all ideas and hints.

Solution

Get the data in long format, join, recode the values and get the data back in wide format.

library(dplyr)
library(tidyr)

raw_responses %>%
  pivot_longer(cols = -id, names_to = 'item') %>%
  left_join(design, by = 'item') %>%
  mutate(value = case_when(value == 9 ~ 99,
                           value == key ~ 1, 
                           TRUE ~ 0)) %>%
  select(-key) %>%
  pivot_wider(names_from = 'item')

# A tibble: 10 x 6
#      id item_1 item_2 item_3 item_4 item_5
#   <int>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
# 1     1     99     99      1      0      0
# 2     2     99     99     99      0      0
# 3     3      1     99      0     99     99
# 4     4      0      1      1     99      1
# 5     5     99      0      1      0      1
# 6     6      0      1      0      0      1
# 7     7      0      0      0      1     99
# 8     8      1     99      0      0      0
# 9     9      0     99     99      0      1
#10    10     99      1     99      1      0

Another approach without getting data into wide format is to use map2_dfc from purrr :

library(purrr)
map2_dfc(raw_responses[-1], design$key, ~case_when(.x == 9 ~ 99,
                                                   .x == .y ~ 1, 
                                                   TRUE ~ 0))

However, for this answer to work we need to ensure that column names in raw_responses and design$item are in the same order. In this example, they are already in the same order however, in the real data if they are not we can achieve it by doing :

raw_responses[-1] <- raw_responses[-1][design$key]