Search code examples
rstatisticsrecode

R: What is an efficient way to recode variables? How do I prorate means?


I was wondering if anyone could point me in the direction of how I would go about recoding multiple variables with the same rules. I have the following df bhs1:

structure(list(bhs1_1 = c(NA, 1, NA, 2, 1, 2), bhs1_2 = c(NA, 
2, NA, 2, 1, 1), bhs1_3 = c(NA, 1, NA, 2, 2, 2), bhs1_4 = c(NA, 
2, NA, 1, 1, 1), bhs1_5 = c(NA, 1, NA, 1, 2, 2), bhs1_6 = c(NA, 
1, NA, 2, 1, 2), bhs1_7 = c(NA, 1, NA, 1, 2, 1), bhs1_8 = c(NA, 
2, NA, 2, 2, 2), bhs1_9 = c(NA, 1, NA, 2, 1, 1), bhs1_10 = c(NA, 
2, NA, 1, 2, 2), bhs1_11 = c(NA, 2, NA, 2, 2, 1), bhs1_12 = c(NA, 
2, NA, 2, 1, 1), bhs1_13 = c(NA, 1, NA, 1, 2, 2), bhs1_14 = c(NA, 
2, NA, 2, 1, 1), bhs1_15 = c(NA, 1, NA, 2, 2, 2), bhs1_16 = c(NA, 
2, NA, 2, 2, 2), bhs1_17 = c(NA, 2, NA, 2, 2, 1), bhs1_18 = c(NA, 
1, NA, 1, 2, 1), bhs1_19 = c(NA, 1, NA, 2, 1, 2), bhs1_20 = c(NA, 
2, NA, 2, 1, 1)), row.names = c(NA, -6L), class = c("tbl_df", 
"tbl", "data.frame")) 

There are two transformation rules, for half of the data set, e.g.,:

(bhs1_2, bhs1_4, bhs1_7, bhs1_9, bhs1_11, bhs1_12, bhs1_14, bhs1_16, bhs1_17, 
bhs1_18, bhs1_20) 
(if_else(1, 1, 0))

and 

(bhs1_1, bhs1_3, bhs1_5, bhs1_6, bhs1_8, bhs1_10, bhs1_13, 
bhs1_15, bhs1_19)
(if_else(2, 1, 0))

Is there an elegant way to write code to meet this use case? If so, can someone please point me in the right direction and/or provide me with a sample?


Solution

  • Here's a solution using dplyr

    library(dplyr)
    case1 <- vars(bhs1_2, bhs1_4, bhs1_7, bhs1_9, bhs1_11, bhs1_12, bhs1_14, bhs1_16, bhs1_17, 
      bhs1_18, bhs1_20) 
    case2 <- vars(bhs1_1, bhs1_3, bhs1_5, bhs1_6, bhs1_8, bhs1_10, bhs1_13, 
      bhs1_15, bhs1_19)
    result <- df %>%
      mutate_at(case1, ~ (. == 1) * 1L) %>%
      mutate_at(case2, ~ (. == 2) * 1L)
    

    Note - I skipped the ifelse statement - I'm just testing for your condition, then converted the TRUE/FALSE responses to numbers by multiplying by 1. I'm also not sure how you want NAs to be handled, but this is ignoring them.

    If you aren't familiar with the pipe operator (%>%), it takes the result of the previous function, and sets it as the first argument of the next function. It's designed to improve code legibility by avoiding lots of function nesting.