Search code examples
rdplyrvectorizationsequencedummy-variable

Create dummy variables that are dependent on IDs following an ordered sequence


Here is my input:

structure(list(date = c(1990, 1991, 1992, 1990, 1991, 1992, 1990, 
1991, 1992, 1990, 1991, 1992, 1990, 1991, 1992, 1990, 1991, 1992, 
1990, 1991, 1992, 1990, 1991, 1992, 1990, 1991, 1992, 1990, 1991, 
1992, 1990, 1991, 1992, 1990, 1991, 1992, 1990, 1991, 1992, 1990, 
1991, 1992, 1990, 1991, 1992, 1990, 1991, 1992, 1990, 1991, 1992, 
1990, 1991, 1992), member1 = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 
2, 2), member2 = c(2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 
2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3), active1 = c(1, 
1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 
1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 
1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1), active2 = c(0, 1, 1, 0, 1, 
1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 
1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 
1, 0, 1, 1, 0, 0, 1), group = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 
2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 
2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 
2, 2), task = c(1, 1, 1, 2, 2, 2, 3, 3, 3, 1, 1, 1, 2, 2, 
2, 3, 3, 3, 1, 1, 1, 2, 2, 2, 3, 3, 3, 1, 1, 1, 2, 2, 2, 3, 3, 
3, 1, 1, 1, 2, 2, 2, 3, 3, 3, 1, 1, 1, 2, 2, 2, 3, 3, 3)), class = "data.frame", row.names = c(NA, -54L))

Here is my desired output:

structure(list(date = c(1990L, 1991L, 1992L, 1990L, 1991L, 1992L, 
1990L, 1991L, 1992L, 1990L, 1991L, 1992L, 1990L, 1991L, 1992L, 
1990L, 1991L, 1992L, 1990L, 1991L, 1992L, 1990L, 1991L, 1992L, 
1990L, 1991L, 1992L, 1990L, 1991L, 1992L, 1990L, 1991L, 1992L, 
1990L, 1991L, 1992L, 1990L, 1991L, 1992L, 1990L, 1991L, 1992L, 
1990L, 1991L, 1992L, 1990L, 1991L, 1992L, 1990L, 1991L, 1992L, 
1990L, 1991L, 1992L), member1 = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), 
    member2 = c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
    2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
    3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
    3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), active1 = c(1L, 
    1L, 1L, 1L, 1L, 0L, 0L, 1L, 1L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 
    0L, 1L, 0L, 1L, 1L, 0L, 1L, 0L, 0L, 0L, 1L, 1L, 1L, 0L, 0L, 
    1L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 
    1L, 1L, 0L, 1L, 1L, 0L, 0L, 1L), active2 = c(0L, 1L, 1L, 
    0L, 1L, 1L, 0L, 1L, 1L, 0L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 1L, 
    0L, 1L, 1L, 0L, 1L, 0L, 0L, 0L, 1L, 1L, 1L, 0L, 0L, 1L, 0L, 
    0L, 0L, 0L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 1L, 
    0L, 1L, 1L, 0L, 0L, 1L), group = c(1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 
    2L, 2L, 2L), task = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 
    1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 1L, 1L, 1L, 2L, 2L, 2L, 
    3L, 3L, 3L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 1L, 1L, 1L, 
    2L, 2L, 2L, 3L, 3L, 3L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L
    ), dummy1 = c(0L, 1L, 1L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 0L, 
    1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 0L, 1L, 0L, 0L, 0L, 
    0L, 1L, 1L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 0L, 0L, 
    0L, 0L, 0L, 0L, 1L, 1L, 1L, 0L, 1L, 1L, 0L, 0L, 1L), dummy2 = c(0L, 
    0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
    0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 
    0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
    0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L)), class = "data.frame", row.names = c(NA, -54L))

What I want to do: I want to create two dummy variables, defined matehmatically as follows. I give a more verbal description below.

enter image description here

A kind of convoluted process. There are several groups of tasks, each task given by its identifier in the variable task, and belonging to a group of tasks as detailed in the variable group. Essentially, I want to make a dummy variable that is dependent on whether (member1, member2) follow the tasks in order in a given year.

For example, take group == 1 for (member1, member2) == (1, 2) (the first nine rows in the dput). For these two members, there are 3 tasks (task == {1, 2, 3}) in one group (group == 1) of tasks. active1 and active2 describe whether member1 and member2, respectively, are 'actively' doing the corresponding task in group in a given year. For task == 1 in year == 1990, only member1 is actively doing the task (active1 == 1, active2 == 0), but for the next two years, both members are actively doing the task (active1 == 1, active2 == 1), and so on.

For each task in each group, and for each pair of members, and in each year, I want to generate TWO dummy variables:

1.) one dummy variable equal to unity if a.) the task is the "first" task in the group (i.e., the task with the minimum number) and both member1 and member2 have active == 1 OR b.) if both member1 and member2 have active == 1 AND the task directly before it is also actively being performed. For example, for group == 1 for member1, member2, we'd have this dummy variable == 1 for all years for task == 1, this dummy variable == 1 for year == 1991 for task == 2, and this dummy variable == 1 for year == 1991 for task == 3. In other years for these latter two tasks, this dummy variable would equal 0.

2.) My second dummy variable that I want to create is essentially the opposite of the first variable. I want it to equal 1 if the task is a.) NOT the first task in the group but both member1 and member2 have active == 1, AND b.) the task directly before it is NOT actively being performed, with the dummy equalling zero otherwise. So for example, for group == 1 for member1, member2, this dummy variable would == 0 for task == 1 in all years, would == 0 for task == 2 in all years, and would == 1 for task == 3 in 1992; it would equal unity in this latter case because member1, member2 are actively performing task == 3 in 1992, but not performing task == 2 in that year.

PLEASE NOTE that the tasks are not ordered (1,2,3...) in my actual database. The tasks skip numbers (e.g, (10.0, 10.07, 11.0...)) but are still ordered, so a solution would either a.) have to avoid using the i, i+1 nomenclature or would first have to convert my task variable into an i, i+1 format.

THANK YOU FOR ANY HELP IN ADVANCE!

UPDATE: I received help and the accepted answer worked great for the dput---I've edited their code for my actual dataset, which may or may not help future answer-seekers:

  df1 <- df1 %>%
    mutate(lagtask=dplyr::lag(x=task, n = 1, order_by=grouping),
           lagact1=dplyr::lag(x=active1,1, order_by=grouping),
           lagact2=dplyr::lag(x=active2,1, order_by=grouping)) %>% 
    mutate(lagact1 = ifelse(is.na(lagact1), active1, lagact1),
           lagact2 = ifelse(is.na(lagact2), active2, lagact2)) %>%
      mutate(dummy1=ifelse(active1 == 1 & active2 == 1 &
                             # lag1 == lag(x=Sequence, 1) &
                             (lagact1 == 1 &
                             lagact2 ==1),
                           1,0
      )) %>%
    mutate(dummy2=ifelse(active1 == 1 & active2 == 1 &
                           # lag1 == lag(x=task, 1) &
                           (lagact1 != 1 |
                           lagact2 != 1),
                         1,0
    ))
  

Solution

  • I'm a little confused because the OP states that dummy2 should be 0 for task 1 and task 2 in all years for group 1 except it should equal 1 for task3 in 1992. However, the expected output shows dummy2 only equal to 1 for task 2 in 1991.

    desired_output[1:9,]
    

    Following the language of the question and not the values in the desired_output object (which was copied from the OP's post), I think the below works.

    output <- df %>%
      group_by(date, group) %>%
      mutate(dummy1 = ifelse(task == first(task) &
                               active1 == 1 &
                               active2 == 1,
                             1, 0)) %>%
      mutate(dummy1 = ifelse(active1 == 1 &
                               active2 == 1 &
                               lag(active1 == 1,
                                   default = 1) &
                               lag(active2 == 1,
                                   default = 1),
                             1, dummy1)) %>%
      mutate(dummy2 = ifelse(task != first(task) &
                               active1 == 1 &
                               active2 == 1,
                             1, 0)) %>%
      mutate(lagtask=lag(x=task, n = 1),
             lagact1=lag(x=active1,1),
             lagact2=lag(x=active2,1)) %>% 
      mutate(dummy2=ifelse(dummy2 == 1 &
                             # lag1 == lag(x=task, 1) &
                             lagact1 == 1 &
                             lagact2 ==1,
                           0,dummy2
      )) %>%
      ungroup() %>%
      as.data.frame() %>% 
      dplyr::select(!c(lagtask,lagact1,lagact2))
    

    UPDATE: More concise code without extra pipes. It was easier to see the steps using the pipes at first.

    output <- df %>%
      group_by(date, group) %>%
      mutate(dummy1 = ifelse(task == first(task) &
                               active1 == 1 &
                               active2 == 1,
                             1, 0),
             dummy1 = ifelse(active1 == 1 &
                               active2 == 1 &
                               lag(active1 == 1,
                                   default = 1) &
                               lag(active2 == 1,
                                   default = 1),
                             1, dummy1),
             dummy2 = ifelse(task != first(task) &
                               active1 == 1 &
                               active2 == 1,
                             1, 0),
             lagtask=lag(x=task, n = 1),
             lagact1=lag(x=active1,1),
             lagact2=lag(x=active2,1),
             dummy2=ifelse(dummy2 == 1 &
                             # lag1 == lag(x=task, 1) &
                             lagact1 == 1 &
                             lagact2 ==1,
                           0,dummy2)) %>%
      ungroup() %>%
      as.data.frame() %>% 
      dplyr::select(!c(lagtask,lagact1,lagact2))