Search code examples
rdplyrdata-cleaning

Writing a function to process multiple if else recoding statements in R


I have a large dataset of substance use data, measuring daily use in the past week. I am trying to write a function that will process it easily. I have two steps I need to accomplish:

  1. Write a recode function.

Current the data is labeled as "sub1", "sub2", "sub3"...all the way to 6. And there is a corresponding "value" variable (e.g. "sub1_value" = Alcohol). And then there are sub1_day1..sub1_day2...etc etc with a number indicating amount of use. I want to create new variables that are based on the substance (e.g. day1_alcohol, day2_alcohol, day1_cocaine etc...)

  1. Write a function that does this for 6 different "sub1/2/3" variables and 15 possible substances.

I don't want to have to copy and paste and edit this 15 different times. I'm trying to develop a function that will do this for me.

I've set up a reprex with 2 substances per person, 4 possible substances (c("Alcohol", "Cocaine", "Opiates", "Cannabis")), 3 days of use.

#Sample Data:

df <-
  data.frame(
    sub1_value = c("Alcohol", "Alcohol", "Cocaine", "Opiates", "Cannabis"),
    sub1_day1 = c(4, 3, 1, 0, 1),
    sub1_day2 = c(4, 7, 1, 0, 0),
    sub1_day3 = c(5, 6, 0, 1, 1),
    sub2_value = c("Cannabis", "Opiates", "Alcohol", "Cocaine", "Alcohol"),
    sub2_day1 = c(7, 2, 0, 0, 0),
    sub2_day2 = c(3, 2, 1, 1, 1),
    sub2_day3 = c(9, 8, 0, 1, 1)
  )

This code gets me to process "sub1" and "sub2": - Is there a more efficient way to write this?


    df <- df %>%
      mutate(
        day1_alc = if_else(
          sub1_value == "Alcohol",
          sub1_day1,
          if_else(sub2_value == "Alcohol", sub2_day1,
                  NA)
        ),
        day2_alc = if_else(
          sub1_value == "Alcohol",
          sub1_day2,
          if_else(sub2_value == "Alcohol", sub2_day2,
                  NA)
        ),
        day3_alc = if_else(
          sub1_value == "Alcohol",
          sub1_day3,
          if_else(sub2_value == "Alcohol", sub2_day3,
                  NA)
        )
      )

And question two - how to write a function that will do this for all days and all substances. As I mentioned I have a lot of days and substances so would like to minimize effort as much as possible.

I'm expecting a data set that keeps the original data files but then also has variables day1_alc, day2_alc, day3_alc day1_cocaine, day2_cocaine...etc that copies over the value from "sub1" to the appropriate new variable.

Any help with guidance on the if statement or the function is appreciate!

Edits: I figured out one solution - hoping for help on the function part.


Solution

  • Pivoting to long format may look like this:

    library(dplyr)
    library(tidyr) # pivot_*
    df %>%
      mutate(rn = row_number()) %>%
      pivot_longer(-c(rn, sub1_value, sub2_value), names_pattern = "(.*)_(.*)", names_to = c(".value", "day"))
    # # A tibble: 15 × 6
    #    sub1_value sub2_value    rn day    sub1  sub2
    #    <chr>      <chr>      <int> <chr> <dbl> <dbl>
    #  1 Alcohol    Cannabis       1 day1      4     7
    #  2 Alcohol    Cannabis       1 day2      4     3
    #  3 Alcohol    Cannabis       1 day3      5     9
    #  4 Alcohol    Opiates        2 day1      3     2
    #  5 Alcohol    Opiates        2 day2      7     2
    #  6 Alcohol    Opiates        2 day3      6     8
    #  7 Cocaine    Alcohol        3 day1      1     0
    #  8 Cocaine    Alcohol        3 day2      1     1
    #  9 Cocaine    Alcohol        3 day3      0     0
    # 10 Opiates    Cocaine        4 day1      0     0
    # 11 Opiates    Cocaine        4 day2      0     1
    # 12 Opiates    Cocaine        4 day3      1     1
    # 13 Cannabis   Alcohol        5 day1      1     0
    # 14 Cannabis   Alcohol        5 day2      0     1
    # 15 Cannabis   Alcohol        5 day3      1     1
    

    From here, we can use case_when to determine the alc column:

    df %>%
      mutate(rn = row_number()) %>%
      pivot_longer(-c(rn, sub1_value, sub2_value), names_pattern = "(.*)_(.*)", names_to = c(".value", "day")) %>%
      mutate(
        alc = case_when(
          sub1_value == "Alcohol" ~ sub1, 
          sub2_value == "Alcohol" ~ sub2, 
          .default = NA)
      )
    # # A tibble: 15 × 7
    #    sub1_value sub2_value    rn day    sub1  sub2   alc
    #    <chr>      <chr>      <int> <chr> <dbl> <dbl> <dbl>
    #  1 Alcohol    Cannabis       1 day1      4     7     4
    #  2 Alcohol    Cannabis       1 day2      4     3     4
    #  3 Alcohol    Cannabis       1 day3      5     9     5
    #  4 Alcohol    Opiates        2 day1      3     2     3
    #  5 Alcohol    Opiates        2 day2      7     2     7
    #  6 Alcohol    Opiates        2 day3      6     8     6
    #  7 Cocaine    Alcohol        3 day1      1     0     0
    #  8 Cocaine    Alcohol        3 day2      1     1     1
    #  9 Cocaine    Alcohol        3 day3      0     0     0
    # 10 Opiates    Cocaine        4 day1      0     0    NA
    # 11 Opiates    Cocaine        4 day2      0     1    NA
    # 12 Opiates    Cocaine        4 day3      1     1    NA
    # 13 Cannabis   Alcohol        5 day1      1     0     0
    # 14 Cannabis   Alcohol        5 day2      0     1     1
    # 15 Cannabis   Alcohol        5 day3      1     1     1
    

    and if you need to have it back in wide format (I do not recommend it, but just in case):

    df %>%
      mutate(rn = row_number()) %>%
      pivot_longer(-c(rn, sub1_value, sub2_value), names_pattern = "(.*)_(.*)", names_to = c(".value", "day")) %>%
      mutate(alc = case_when(sub1_value == "Alcohol" ~ sub1, sub2_value == "Alcohol" ~ sub2, .default = NA)) %>%
      pivot_wider(id_cols = c(rn, sub1_value, sub2_value), names_from = day, values_from = c(sub1, sub2, alc))
    # # A tibble: 5 × 12
    #      rn sub1_value sub2_value sub1_day1 sub1_day2 sub1_day3 sub2_day1 sub2_day2 sub2_day3 alc_day1 alc_day2 alc_day3
    #   <int> <chr>      <chr>          <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>    <dbl>    <dbl>    <dbl>
    # 1     1 Alcohol    Cannabis           4         4         5         7         3         9        4        4        5
    # 2     2 Alcohol    Opiates            3         7         6         2         2         8        3        7        6
    # 3     3 Cocaine    Alcohol            1         1         0         0         1         0        0        1        0
    # 4     4 Opiates    Cocaine            0         0         1         0         1         1       NA       NA       NA
    # 5     5 Cannabis   Alcohol            1         0         1         0         1         1        0        1        1
    

    The use of rn is only required if you need to bring it back to the original wide format. If you have another field not included that is a good id-like field, it might be more appropriate, though I don't think rn goes wrong here.