Search code examples
rdplyrlaggroup

Create a dummy variable based on lagged combination of multiple other variables - R


Say I have the following data:

date employee company rolename company_conglomerate
31-12-2008 X A manager 1
31-12-2008 Y B manager 0
31-12-2008 Z C manager 0
31-12-2009 Y A CEO 1
31-12-2009 X B CEO 0
31-12-2009 Z C CEO 0

Based on the above data, I want to create a new dummy variable that is equal to 1 if a CEO has prior experience working at a conglomerate company, and 0 otherwise.

date employee company rolename company_conglomerate CEO_prior_conglomerate_experience
31-12-2008 X A manager 1 0
31-12-2008 Y B manager 0 0
31-12-2008 Z C manager 0 0
31-12-2009 Y A CEO 1 0
31-12-2009 X B CEO 0 1
31-12-2009 Z C CEO 0 0

Notice that employee X switches from company A to company B, and employee Y switches from company B to company A. Both were promoted, however, the new variable is only equal to 1 for employee X since they had prior experience at a conglomerate firm.

I can't seem to think of a solution or find my way around this problem. Any help here would be appreciated.


Solution

  • We can use lag(cumany(.)) after arranging by date to determine if an employee previously met the condition.

    library(dplyr)
    quux %>%
      mutate(date = as.Date(date, format = "%d-%m-%Y")) %>%
      arrange(date) %>%
      group_by(employee) %>%
      mutate(prior = (rolename == "CEO") & lag(cumany(company_conglomerate > 0))) %>%
      ungroup()
    # # A tibble: 6 × 6
    #   date       employee company rolename company_conglomerate prior
    #   <date>     <chr>    <chr>   <chr>                   <int> <lgl>
    # 1 2008-12-31 X        A       manager                     1 FALSE
    # 2 2008-12-31 Y        B       manager                     0 FALSE
    # 3 2008-12-31 Z        C       manager                     0 FALSE
    # 4 2009-12-31 Y        A       CEO                         1 FALSE
    # 5 2009-12-31 X        B       CEO                         0 TRUE 
    # 6 2009-12-31 Z        C       CEO                         0 FALSE
    

    Note that since we're ordering by date, we need to format it for better sorting.


    Data

    quux <- structure(list(date = c("31-12-2008", "31-12-2008", "31-12-2008", "31-12-2009", "31-12-2009", "31-12-2009"), employee = c("X", "Y", "Z", "Y", "X", "Z"), company = c("A", "B", "C", "A", "B", "C"), rolename = c("manager", "manager", "manager", "CEO", "CEO", "CEO"), company_conglomerate = c(1L, 0L, 0L, 1L, 0L, 0L)), class = "data.frame", row.names = c(NA, -6L))