I have a large dataset for which I want to create 50 new variables where the values are conditional on values in previous columns, and the name of the variables reflect this fact. To make it more intelligible, here is an example:
df <- tibble("a" = runif(10,1990,2000),
"event" = 1995) %>%
mutate("relative_event" = a - event)
Now with this dataset I would like to create dummy variables that code if the specific observation is one year prior to the event, 2 year prior, etc, as well as forward. One clumsy way to do this (which works) is:
df <- df %>%
mutate("event_b1" = ifelse( (relative_event<=0) & (relative_event > -1),1,0)) %>%
mutate("event_b2" = ifelse( (relative_event<=-1) & (relative_event > -2),1,0)) %>% #etc with more lagx
mutate("event_f1" = ifelse( (relative_event>0) & (relative_event < 1),1,0)) %>%
mutate("event_f2" = ifelse( (relative_event>1) & (relative_event < 2 ),1,0)) #etc with more forward
where b1 is for "one year before" and f2 is for "2 years forward". The result looks like this:
A tibble: 10 x 7
a event relative_event event_b1 event_b2 event_f1 event_f2
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1993. 1995 -1.94 0 1 0 0
2 1992. 1995 -2.59 0 0 0 0
3 2000. 1995 4.75 0 0 0 0
4 1998. 1995 3.25 0 0 0 0
5 1991. 1995 -3.88 0 0 0 0
6 1992. 1995 -3.02 0 0 0 0
7 1996. 1995 1.08 0 0 0 1
8 1994. 1995 -1.04 0 1 0 0
9 1993. 1995 -2.22 0 0 0 0
10 1995. 1995 -0.302 1 0 0 0
Since I have more than 50 columns to create I would like to know how to do it automatically so that I don't have to copy-paste 49 times and manually change the condition and the variable name. I spent time looking on SO on this thread, this one and on CV as well but I am still clueless. I tried the following code which does not work:
for (i in 0:10) {
if (i<0) {
event_bi <- paste0("event_b",i)
df <- df %>%
mutate(get(event_bi) = ifelse((relative_event<=-(i-1)) & (relative_event>-i),1,0))
}
}
Ideally I'd like to learn how to do it with dplyr but if there is an obvious Base R solution I'm happy to learn it as well.
Thanks!
Although I prefer a solution with all variables in one column as suggested by @Patrick (although I would use something like %>% mutate(new_col = case_when(etc...))
, here a way with for-loop
# I changed your data a tiny bit
df <- tibble("a" = sample(1990:2000, size = 10), # better to use 'sample' then 'runif' !
"event" = 1995) %>% mutate("relative_event" = a - event)
Now the actual work
for (i in min(df$relative_event):max(df$relative_event)) {
# the indexing value is your difference in years. So you have to run the index from the lowest difference to the highest.
if( i < 0 ) {
df[[paste0('event_b', abs(i))]] <- ifelse(i == df$relative_event, 1, 0)
}
if( i >= 0 ) {
df[[paste0('event_f', abs(i))]] <- ifelse(i == df$relative_event, 1, 0)
df
}
}
# A tibble: 10 x 14
a event relative_event event_b5 event_b4 event_b3 event_b2 event_b1
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1990 1995 -5 1 0 0 0 0
2 1992 1995 -3 0 0 1 0 0
3 1991 1995 -4 0 1 0 0 0
4 2000 1995 5 0 0 0 0 0
5 1998 1995 3 0 0 0 0 0
6 1993 1995 -2 0 0 0 1 0
7 1996 1995 1 0 0 0 0 0
8 1997 1995 2 0 0 0 0 0
9 1994 1995 -1 0 0 0 0 1
10 1999 1995 4 0 0 0 0 0
# ... with 6 more variables: event_f0 <dbl>, event_f1 <dbl>, event_f2 <dbl>,
# event_f3 <dbl>, event_f4 <dbl>, event_f5 <dbl>
If you don't want to run through every possible difference in years - (this will create 'empty' columns) - you could simply create a vector with unique(df$relative_event)
and run i
through this vector