Search code examples
rmutaterowwise

creating variable from rowwise operation depending on column values


sample data

# Set seed for reproducibility
set.seed(123)

# Create a sample dataframe with 100 observations
n_obs <- 100
df <- data.frame(
  serial_id = 1:n_obs,
  code_1 = sample(c("yes", "no"), n_obs, replace = TRUE),
  code_2 = sample(c("yes", "no"), n_obs, replace = TRUE),
  code_3 = sample(c("yes", "no"), n_obs, replace = TRUE),
  type_1 = sample(c("A", "B", "C", "D"), n_obs, replace = TRUE),
  type_2 = sample(c("A", "B", "C", "D"), n_obs, replace = TRUE),
  type_3 = sample(c("A", "B", "C", "D"), n_obs, replace = TRUE)
)

I am trying to create a variable that satisfies the following logic:

  1. For each row: if any of the code_* columns have "yes", AND the corresponding type_* column (For example: code_1 corresponds with type_1 and so on) have "A", the new variable takes in "1".
  2. For each row: if any of the code_* columns have "yes", AND the corresponding type_* column (For example: code_1 corresponds with type_1 and so on) have "B", the new variable takes in "0". This rule overrides all the previous rules, even if there there is a logic of a combination of "yes" and "A" that should have resulted in a "1"
  3. For each row: If there is "no" for corresponding code_* of both "B" and "C" in any of the type_* columns and a "yes" in a code_* which has a "C" in it's corresponding type_x, then new_var == 1

I could not figure out, how get the names of the corresponding column based on the last character (1,2,3,4,5,6.......), and then performing a rowwise operation taking into columns of that isolated row. The original data has such about 20 of such pair of code_* and type_*. So, I am trying to come up with something iterable.


Solution

  • Convert to long, cast to a wider format with the code and type paired, apply the rules by serial_id with any(), and then left join the new variable back to the original data set on serial_id. This should work for as many code/type pairs as you have.

    # Set seed for reproducibility
    set.seed(123)
    
    # Create a sample dataframe with 100 observations
    library(data.table)
    n_obs <- 100
    df <- data.table(
      serial_id = 1:n_obs,
      code_1 = sample(c("yes", "no"), n_obs, replace = TRUE),
      code_2 = sample(c("yes", "no"), n_obs, replace = TRUE),
      code_3 = sample(c("yes", "no"), n_obs, replace = TRUE),
      type_1 = sample(c("A", "B", "C", "D"), n_obs, replace = TRUE),
      type_2 = sample(c("A", "B", "C", "D"), n_obs, replace = TRUE),
      type_3 = sample(c("A", "B", "C", "D"), n_obs, replace = TRUE)
    )
    
    # Convert to long format
    df_long <- melt(df, id.vars = 'serial_id')
    df_long[, type := tstrsplit(variable, "_", keep = 1)]
    df_long[, index := tstrsplit(variable, "_", keep = 2)]
    
    # Cast to a paired wide format.
    df_wide <- dcast(df_long, serial_id + index ~ type,
                     value.var = "value")
    
    # Apply rules
    # Rule 1) Any yes + A => new_var = 1.
    df_wide[, new_var := ifelse(any(code == "yes" & type == "A"), 1, NA_real_),
            by = serial_id]
    
    # Rule 2) Any yes + B => new_var = 0, Overwrite rule 1, not rule 2
    df_wide[, new_var := ifelse(any(code == "yes" & type == "B"), 0, new_var),
            by = serial_id]
    
    # Rule 3) all(no + B, no + C, yes + C) => new_var = 1.  Overwrite rule 2.
    # Example, serial_id = 38.
    df_wide[, new_var := 
              ifelse(any(code == "no" & type == "B") &
                       any(code == "no" & type == "C") &
                       any(code == "yes" & type == "C"), 1, new_var),
            by = serial_id]
    
    # Join back to the original data frame
    df[df_wide[, .SD[1], by = serial_id], 
       new_var := new_var, on = .(serial_id)]