Search code examples
rdplyrdata-transform

r generate indicator columns based on conditions


Suppose this is my dataset

State     Enter Event  Lag 
State-A   2000  2004  -4 
State-A   2001  2004  -3
State-A   2002  2004  -2
State-A   2003  2004  -1
State-A   2004  2004   0
State-A   2005  2004   1
State-A   2006  2004   2
State-A   2007  2004   3
State-A   2008  2004   4
State-A   2009  2004   5
State-B   2000  2004  -5
State-B   2001  2004  -4
State-B   2002  2004  -3
State-B   2003  2004  -2
State-B   2004  2004  -1
State-B   2005  2004   0
State-B   2006  2004   1
State-B   2007  2004   2
State-B   2008  2004   3
State-B   2009  2004   4

My goal is the create several indicator columns based on the values in column Lag and user defined Event window.

If the user defined event window is between 2001 and 2008 then the expected dataset is to be like this

    State   Enter   Event   Lag LagM3_2001  LagM2_2002  LagM1_2003  Lead0_2004  Lead1_2005  Lead2_2006  Lead3_2007  Lead5_2008
   State-A  2000    2004    -4  1           0           0           0           0           0           0           0
   State-A  2001    2004    -3  1           0           0           0           0           0           0           0
   State-A  2002    2004    -2  0           1           0           0           0           0           0           0
   State-A  2003    2004    -1  0           0           1           0           0           0           0           0
   State-A  2004    2004    0   0           0           0           1           0           0           0           0
   State-A  2005    2004    1   0           0           0           0           1           0           0           0
   State-A  2006    2004    2   0           0           0           0           0           1           0           0
   State-A  2007    2004    3   0           0           0           0           0           0           1           0    
   State-A  2008    2004    4   0           0           0           0           0           0           0           1
   State-A  2009    2004    5   0           0           0           0           0           0           0           1
   State-B  2000    2004    -4  1           0           0           0           0           0           0           0
   State-B  2001    2004    -3  1           0           0           0           0           0           0           0
   State-B  2002    2004    -2  0           1           0           0           0           0           0           0
   State-B  2003    2004    -1  0           0           1           0           0           0           0           0
   State-B  2004    2004    0   0           0           0           1           0           0           0           0
   State-B  2005    2004    1   0           0           0           0           1           0           0           0
   State-B  2006    2004    2   0           0           0           0           0           1           0           0
   State-B  2007    2004    3   0           0           0           0           0           0           1           0
   State-B  2008    2004    4   0           0           0           0           0           0           0           1
   State-B  2009    2004    5   0           0           0           0           0           0           0           1

Since the user defined event window is between 2001 and 2008. First generate 2004 -2001 (3 lag columns), 2008 - 2004 (4 lead columns), 1 (Lead0) column .

Start with the lowest column which is LagM3_2001

  • Any row where value in Enter column is less or equal to (2004-3)=2001 is assigned value 1 So based on this logic, values in column LagM3_2001 is 1 when Enter is 2001 or less(2000,1999) etc.

I will jump to the highest column which is Lead5_2008

  • Any row where value in Enter column is greater than or equal to (2004+4)=2008 is assigned value 1 So based on this logic, values in column Lead5_2008 is 1 when Enter is 2008 or higher(2008,2009) etc.

This is the logic for any columns that are in between the lowest(LagM3_2001) and highest (Lead5_2008)

  • Values in Column (LagM2_2002) is 1 if; 2004-2 = 2002 & value in Event column is 2002, 0 everywhere else

  • Values in Column (LagM2_2003) is 1 if; 2004-1 = 2003 & value in Event column is 2001, 0 everywhere else

  • Values in Column (Lead0_2004) is 1 if; 2004-0 = 2004 & value in Event column is 2000, 0 everywhere else

  • Values in Column (Lead1_2005) is 1 if; 2004+1 = 2005 & value in Event column is 2005, 0 everywhere else

  • Values in Column (Lead1_2006) is 1 if; 2004+2 = 2006 & value in Event column is 2006, 0 everywhere else

  • Values in Column (Lead1_2007) is 1 if; 2004+3 = 2007 & value in Event column is 2007, 0 everywhere else

I could do this using lot of ifelse statements, but if there is an efficient way to code this, I would much appreciate it. Thanks in advance.


Solution

  • pacman::p_load(tidyverse, glue)
    
    create_eventdf <- function(df, start, end) {
        df |>
        mutate(Category = case_when(Enter <= start ~ glue("LagM{Event-start}_{start}"),
                                    Enter >= end ~ glue("Lead{Event-Enter+1}_{end}"),
                                    Enter < Event ~ glue("LagM{Event-Enter}_{Enter}"),
                                    TRUE ~ glue("Lead{Event-Enter+1}_{Event}")))|>
                                    pivot_wider( names_from = Category, values_from = Category, values_fn = length, values_fill = 0, unused_fn = NULL)
    
    }
    
    create_eventdf(df, 2001, 2008)
    

    Output:

    # A tibble: 20 × 13
       State   Enter Event   Lag LagM3_2001 LagM2_2002 LagM1_2003 Lead1_2004
       <chr>   <int> <int> <int>      <int>      <int>      <int>      <int>
     1 State-A  2000  2004    -4          1          0          0          0
     2 State-A  2001  2004    -3          1          0          0          0
     3 State-A  2002  2004    -2          0          1          0          0
     4 State-A  2003  2004    -1          0          0          1          0
     5 State-A  2004  2004     0          0          0          0          1
     6 State-A  2005  2004     1          0          0          0          0
     7 State-A  2006  2004     2          0          0          0          0
     8 State-A  2007  2004     3          0          0          0          0
     9 State-A  2008  2004     4          0          0          0          0
    10 State-A  2009  2004     5          0          0          0          0
    11 State-B  2000  2004    -5          1          0          0          0
    12 State-B  2001  2004    -4          1          0          0          0
    13 State-B  2002  2004    -3          0          1          0          0
    14 State-B  2003  2004    -2          0          0          1          0
    15 State-B  2004  2004    -1          0          0          0          1
    16 State-B  2005  2004     0          0          0          0          0
    17 State-B  2006  2004     1          0          0          0          0
    18 State-B  2007  2004     2          0          0          0          0
    19 State-B  2008  2004     3          0          0          0          0
    20 State-B  2009  2004     4          0          0          0          0
    # ℹ 5 more variables: Lead0_2004 <int>, `Lead-1_2004` <int>,
    #   `Lead-2_2004` <int>, `Lead-3_2008` <int>, `Lead-4_2008` <int>