I have a large dataset of substance use data, measuring daily use in the past week. I am trying to write a function that will process it easily. I have two steps I need to accomplish:
Current the data is labeled as "sub1", "sub2", "sub3"...all the way to 6. And there is a corresponding "value" variable (e.g. "sub1_value" = Alcohol). And then there are sub1_day1..sub1_day2...etc etc with a number indicating amount of use. I want to create new variables that are based on the substance (e.g. day1_alcohol, day2_alcohol, day1_cocaine etc...)
I don't want to have to copy and paste and edit this 15 different times. I'm trying to develop a function that will do this for me.
I've set up a reprex with 2 substances per person, 4 possible substances (c("Alcohol", "Cocaine", "Opiates", "Cannabis")), 3 days of use.
#Sample Data:
df <-
data.frame(
sub1_value = c("Alcohol", "Alcohol", "Cocaine", "Opiates", "Cannabis"),
sub1_day1 = c(4, 3, 1, 0, 1),
sub1_day2 = c(4, 7, 1, 0, 0),
sub1_day3 = c(5, 6, 0, 1, 1),
sub2_value = c("Cannabis", "Opiates", "Alcohol", "Cocaine", "Alcohol"),
sub2_day1 = c(7, 2, 0, 0, 0),
sub2_day2 = c(3, 2, 1, 1, 1),
sub2_day3 = c(9, 8, 0, 1, 1)
)
This code gets me to process "sub1" and "sub2": - Is there a more efficient way to write this?
df <- df %>%
mutate(
day1_alc = if_else(
sub1_value == "Alcohol",
sub1_day1,
if_else(sub2_value == "Alcohol", sub2_day1,
NA)
),
day2_alc = if_else(
sub1_value == "Alcohol",
sub1_day2,
if_else(sub2_value == "Alcohol", sub2_day2,
NA)
),
day3_alc = if_else(
sub1_value == "Alcohol",
sub1_day3,
if_else(sub2_value == "Alcohol", sub2_day3,
NA)
)
)
And question two - how to write a function that will do this for all days and all substances. As I mentioned I have a lot of days and substances so would like to minimize effort as much as possible.
I'm expecting a data set that keeps the original data files but then also has variables day1_alc, day2_alc, day3_alc day1_cocaine, day2_cocaine...etc that copies over the value from "sub1" to the appropriate new variable.
Any help with guidance on the if statement or the function is appreciate!
Edits: I figured out one solution - hoping for help on the function part.
Pivoting to long format may look like this:
library(dplyr)
library(tidyr) # pivot_*
df %>%
mutate(rn = row_number()) %>%
pivot_longer(-c(rn, sub1_value, sub2_value), names_pattern = "(.*)_(.*)", names_to = c(".value", "day"))
# # A tibble: 15 × 6
# sub1_value sub2_value rn day sub1 sub2
# <chr> <chr> <int> <chr> <dbl> <dbl>
# 1 Alcohol Cannabis 1 day1 4 7
# 2 Alcohol Cannabis 1 day2 4 3
# 3 Alcohol Cannabis 1 day3 5 9
# 4 Alcohol Opiates 2 day1 3 2
# 5 Alcohol Opiates 2 day2 7 2
# 6 Alcohol Opiates 2 day3 6 8
# 7 Cocaine Alcohol 3 day1 1 0
# 8 Cocaine Alcohol 3 day2 1 1
# 9 Cocaine Alcohol 3 day3 0 0
# 10 Opiates Cocaine 4 day1 0 0
# 11 Opiates Cocaine 4 day2 0 1
# 12 Opiates Cocaine 4 day3 1 1
# 13 Cannabis Alcohol 5 day1 1 0
# 14 Cannabis Alcohol 5 day2 0 1
# 15 Cannabis Alcohol 5 day3 1 1
From here, we can use case_when
to determine the alc
column:
df %>%
mutate(rn = row_number()) %>%
pivot_longer(-c(rn, sub1_value, sub2_value), names_pattern = "(.*)_(.*)", names_to = c(".value", "day")) %>%
mutate(
alc = case_when(
sub1_value == "Alcohol" ~ sub1,
sub2_value == "Alcohol" ~ sub2,
.default = NA)
)
# # A tibble: 15 × 7
# sub1_value sub2_value rn day sub1 sub2 alc
# <chr> <chr> <int> <chr> <dbl> <dbl> <dbl>
# 1 Alcohol Cannabis 1 day1 4 7 4
# 2 Alcohol Cannabis 1 day2 4 3 4
# 3 Alcohol Cannabis 1 day3 5 9 5
# 4 Alcohol Opiates 2 day1 3 2 3
# 5 Alcohol Opiates 2 day2 7 2 7
# 6 Alcohol Opiates 2 day3 6 8 6
# 7 Cocaine Alcohol 3 day1 1 0 0
# 8 Cocaine Alcohol 3 day2 1 1 1
# 9 Cocaine Alcohol 3 day3 0 0 0
# 10 Opiates Cocaine 4 day1 0 0 NA
# 11 Opiates Cocaine 4 day2 0 1 NA
# 12 Opiates Cocaine 4 day3 1 1 NA
# 13 Cannabis Alcohol 5 day1 1 0 0
# 14 Cannabis Alcohol 5 day2 0 1 1
# 15 Cannabis Alcohol 5 day3 1 1 1
and if you need to have it back in wide format (I do not recommend it, but just in case):
df %>%
mutate(rn = row_number()) %>%
pivot_longer(-c(rn, sub1_value, sub2_value), names_pattern = "(.*)_(.*)", names_to = c(".value", "day")) %>%
mutate(alc = case_when(sub1_value == "Alcohol" ~ sub1, sub2_value == "Alcohol" ~ sub2, .default = NA)) %>%
pivot_wider(id_cols = c(rn, sub1_value, sub2_value), names_from = day, values_from = c(sub1, sub2, alc))
# # A tibble: 5 × 12
# rn sub1_value sub2_value sub1_day1 sub1_day2 sub1_day3 sub2_day1 sub2_day2 sub2_day3 alc_day1 alc_day2 alc_day3
# <int> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 Alcohol Cannabis 4 4 5 7 3 9 4 4 5
# 2 2 Alcohol Opiates 3 7 6 2 2 8 3 7 6
# 3 3 Cocaine Alcohol 1 1 0 0 1 0 0 1 0
# 4 4 Opiates Cocaine 0 0 1 0 1 1 NA NA NA
# 5 5 Cannabis Alcohol 1 0 1 0 1 1 0 1 1
The use of rn
is only required if you need to bring it back to the original wide format. If you have another field not included that is a good id
-like field, it might be more appropriate, though I don't think rn
goes wrong here.