I have a huge dataset (df) containing numerous columns listing information on number of people on treatment x (e.g. Treat_1) per country (ISO3) and per year (Year).
I need to generate a new column that is called "Flag_Treat_X" (where X stands for the original column name e.g Treat_1). In this column I need to record any treatment decline compared to the previous year as "down", increases as "up" and NA where there is missing data. All columns containing treatment information have "_" in their names, but the actual df uses complex names like HIV_treatment, etc instead of a combination of string and numbers as used here. I know several ways to do it column by column (see below) but the dataset is so huge and the variable names changes often that I need an automated way of doing this.
Here is the code for a simplified example:
ISO3 <- c("AFG", "AFG","AFG", "BEN", "BEN","BEN", "GIN", "GIN", "GIN", "ZWE", "ZWE", "ZWE")
Year <- c(2020, 2021, 2022, 2020, 2021, 2022, 2020, 2021, 2022, 2020, 2021, 2022)
Treat_1 <- c(100, 110, 120, 300, 330, 360, 200, 220, 100, 300, NA, 320)
df = data.frame(ISO3, Year, Treat_1)
Here is an example of how I could do it column by column by hand
df_poutcome %>%
group_by(ISO3) %>%
mutate(Target_art_n_pf = case_when(Treat_1<lag(Treat_1) ~ "down",
Treat_1>lag(Treat_1) ~ "up",
TRUE ~ as.character(NA)))
Here is an example that "nearly" works using a loop (not very elegant)
for (i in grep("_", names(temp), value=TRUE)){
varname = ((gsub(" ", "", paste("Flag_",i))))
temp = temp %>%
group_by(ISO3) %>%
mutate(!!varname:= case_when(i<lag(i) ~ "down",
i>lag(i) ~ "up"))
}
The error is that I only get NA values for all the new "Flag_[...]" variables and the code
TRUE ~ as.character(NA)
does not run in the case_when code so I have to remove it.
The desired outcome should look as follows:
ISO3 | Year | Treat_1 | Flag_Treat_1 |
---|---|---|---|
AFG | 2020 | 100 | NA |
AFG | 2021 | 110 | up |
AFG | 2022 | 120 | up |
BEN | 2020 | 300 | NA |
BEN | 2021 | 330 | up |
BEN | 2022 | 360 | up |
GIN | 2020 | 200 | NA |
GIN | 2021 | 220 | up |
GIN | 2022 | 100 | down |
ZWE | 2020 | 300 | NA |
ZWE | 2021 | NA | NA |
ZWE | 2022 | 320 | NA |
Thank in advance for any help.
Thank you so much to everyone for the quick responses. I ended up using a blend of the responses to make it run on my side. Code below.
df %>%
group_by(ISO3) %>%
mutate(across(contains("_"),
~case_when(
.x < lag(.x) ~ "down",
.x > lag(.x) ~ "up",
TRUE ~ as.character(NA)
),
.names = "flag_{.col}"
))
I had to change the \(x)
suggested by Gregor Thomas to the code suggested by jdobres ~case_when( .x < lag(.x) ~ "down", .x > lag(.x) ~ "up", TRUE ~ as.character(NA) )
as it was giving me an unexpected symbol