Writing a function to process multiple if else recoding statements in R

I have a large dataset of substance use data, measuring daily use in the past week. I am trying to write a function that will process it easily. I have two steps I need to accomplish:

Write a recode function.

Current the data is labeled as "sub1", "sub2", "sub3"...all the way to 6. And there is a corresponding "value" variable (e.g. "sub1_value" = Alcohol). And then there are sub1_day1..sub1_day2...etc etc with a number indicating amount of use. I want to create new variables that are based on the substance (e.g. day1_alcohol, day2_alcohol, day1_cocaine etc...)

Write a function that does this for 6 different "sub1/2/3" variables and 15 possible substances.

I don't want to have to copy and paste and edit this 15 different times. I'm trying to develop a function that will do this for me.

I've set up a reprex with 2 substances per person, 4 possible substances (c("Alcohol", "Cocaine", "Opiates", "Cannabis")), 3 days of use.

#Sample Data:

df <-
  data.frame(
    sub1_value = c("Alcohol", "Alcohol", "Cocaine", "Opiates", "Cannabis"),
    sub1_day1 = c(4, 3, 1, 0, 1),
    sub1_day2 = c(4, 7, 1, 0, 0),
    sub1_day3 = c(5, 6, 0, 1, 1),
    sub2_value = c("Cannabis", "Opiates", "Alcohol", "Cocaine", "Alcohol"),
    sub2_day1 = c(7, 2, 0, 0, 0),
    sub2_day2 = c(3, 2, 1, 1, 1),
    sub2_day3 = c(9, 8, 0, 1, 1)
  )

This code gets me to process "sub1" and "sub2": - Is there a more efficient way to write this?


    df <- df %>%
      mutate(
        day1_alc = if_else(
          sub1_value == "Alcohol",
          sub1_day1,
          if_else(sub2_value == "Alcohol", sub2_day1,
                  NA)
        ),
        day2_alc = if_else(
          sub1_value == "Alcohol",
          sub1_day2,
          if_else(sub2_value == "Alcohol", sub2_day2,
                  NA)
        ),
        day3_alc = if_else(
          sub1_value == "Alcohol",
          sub1_day3,
          if_else(sub2_value == "Alcohol", sub2_day3,
                  NA)
        )
      )

And question two - how to write a function that will do this for all days and all substances. As I mentioned I have a lot of days and substances so would like to minimize effort as much as possible.

I'm expecting a data set that keeps the original data files but then also has variables day1_alc, day2_alc, day3_alc day1_cocaine, day2_cocaine...etc that copies over the value from "sub1" to the appropriate new variable.

Any help with guidance on the if statement or the function is appreciate!

Edits: I figured out one solution - hoping for help on the function part.

Solution

Pivoting to long format may look like this:

library(dplyr)
library(tidyr) # pivot_*
df %>%
  mutate(rn = row_number()) %>%
  pivot_longer(-c(rn, sub1_value, sub2_value), names_pattern = "(.*)_(.*)", names_to = c(".value", "day"))
# # A tibble: 15 × 6
#    sub1_value sub2_value    rn day    sub1  sub2
#    <chr>      <chr>      <int> <chr> <dbl> <dbl>
#  1 Alcohol    Cannabis       1 day1      4     7
#  2 Alcohol    Cannabis       1 day2      4     3
#  3 Alcohol    Cannabis       1 day3      5     9
#  4 Alcohol    Opiates        2 day1      3     2
#  5 Alcohol    Opiates        2 day2      7     2
#  6 Alcohol    Opiates        2 day3      6     8
#  7 Cocaine    Alcohol        3 day1      1     0
#  8 Cocaine    Alcohol        3 day2      1     1
#  9 Cocaine    Alcohol        3 day3      0     0
# 10 Opiates    Cocaine        4 day1      0     0
# 11 Opiates    Cocaine        4 day2      0     1
# 12 Opiates    Cocaine        4 day3      1     1
# 13 Cannabis   Alcohol        5 day1      1     0
# 14 Cannabis   Alcohol        5 day2      0     1
# 15 Cannabis   Alcohol        5 day3      1     1

From here, we can use case_when to determine the alc column:

df %>%
  mutate(rn = row_number()) %>%
  pivot_longer(-c(rn, sub1_value, sub2_value), names_pattern = "(.*)_(.*)", names_to = c(".value", "day")) %>%
  mutate(
    alc = case_when(
      sub1_value == "Alcohol" ~ sub1, 
      sub2_value == "Alcohol" ~ sub2, 
      .default = NA)
  )
# # A tibble: 15 × 7
#    sub1_value sub2_value    rn day    sub1  sub2   alc
#    <chr>      <chr>      <int> <chr> <dbl> <dbl> <dbl>
#  1 Alcohol    Cannabis       1 day1      4     7     4
#  2 Alcohol    Cannabis       1 day2      4     3     4
#  3 Alcohol    Cannabis       1 day3      5     9     5
#  4 Alcohol    Opiates        2 day1      3     2     3
#  5 Alcohol    Opiates        2 day2      7     2     7
#  6 Alcohol    Opiates        2 day3      6     8     6
#  7 Cocaine    Alcohol        3 day1      1     0     0
#  8 Cocaine    Alcohol        3 day2      1     1     1
#  9 Cocaine    Alcohol        3 day3      0     0     0
# 10 Opiates    Cocaine        4 day1      0     0    NA
# 11 Opiates    Cocaine        4 day2      0     1    NA
# 12 Opiates    Cocaine        4 day3      1     1    NA
# 13 Cannabis   Alcohol        5 day1      1     0     0
# 14 Cannabis   Alcohol        5 day2      0     1     1
# 15 Cannabis   Alcohol        5 day3      1     1     1

and if you need to have it back in wide format (I do not recommend it, but just in case):

df %>%
  mutate(rn = row_number()) %>%
  pivot_longer(-c(rn, sub1_value, sub2_value), names_pattern = "(.*)_(.*)", names_to = c(".value", "day")) %>%
  mutate(alc = case_when(sub1_value == "Alcohol" ~ sub1, sub2_value == "Alcohol" ~ sub2, .default = NA)) %>%
  pivot_wider(id_cols = c(rn, sub1_value, sub2_value), names_from = day, values_from = c(sub1, sub2, alc))
# # A tibble: 5 × 12
#      rn sub1_value sub2_value sub1_day1 sub1_day2 sub1_day3 sub2_day1 sub2_day2 sub2_day3 alc_day1 alc_day2 alc_day3
#   <int> <chr>      <chr>          <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>    <dbl>    <dbl>    <dbl>
# 1     1 Alcohol    Cannabis           4         4         5         7         3         9        4        4        5
# 2     2 Alcohol    Opiates            3         7         6         2         2         8        3        7        6
# 3     3 Cocaine    Alcohol            1         1         0         0         1         0        0        1        0
# 4     4 Opiates    Cocaine            0         0         1         0         1         1       NA       NA       NA
# 5     5 Cannabis   Alcohol            1         0         1         0         1         1        0        1        1

The use of rn is only required if you need to bring it back to the original wide format. If you have another field not included that is a good id-like field, it might be more appropriate, though I don't think rn goes wrong here.