Search code examples
rdplyrdummy-variable

Building dummy variable with many conditions (R)


My dataset looks something like this

ID  YOB  ATT94  GRADE94  ATT96  GRADE96  ATT 96 .....
1  1975     1        12      0       NA
2  1985     1        3       1       5
3  1977     0        NA      0       NA
4  ......

(with ATTXX a dummy var. denoting attendance at school in year XX, GRADEXX denoting the school grade)

I'm trying to create a dummy variable that = 1 if an individual is attending school when they are 19/20 years old. e.g. if YOB = 1988 and ATT98 = 1 then the new variable = 1 etc. I've been attempting this using mutate in dplyr but I'm new to R (and coding in general!) so struggle to get anything other than an error any code I write.

Any help would be appreciated, thanks.

Edit:

So, I've just noticed that something has gone wrong, I changed your code a bit just to add another column to the long format data table. Here is what I did in the end:

df %>%
  melt(id = c("ID", "DOB") %>%
  tbl_df() %>%
  mutate(dummy = ifelse(value - DOB %in% c(19,20), 1, 0)) 

so it looks something like e.g.

    ID  YOB   VARIABLE  VALUE  dummy
    1   1979  ATT94     1994   1
    1   1979  ATT96     1996   1
    1   1979  ATT98     0      0 
    2   1976  ATT94     0      0
    2   1976  ATT96     1996   1 
    2   1976  ATT98     1998   1

i.e. whenever the ATT variables take a value other than 0 the dummy = 1, even if they're not 19/20 years old. Any ideas what could be going wrong?


Solution

  • Welcome to the world of code! R's syntax can be tricky (even for experienced coders) and dplyr adds its own quirks. First off, it's useful when you ask questions to provide code that other people can run in order to be able to reproduce your data. You can learn more about that here.

    Are you trying to create code that works for all possible values of DOB and ATTx? In other words, do you have a whole bunch of variables that start with ATT and you want to look at all of them? That format is called wide data, and R works much better with long data. Fortunately the reshape2 package does exactly that. The code below creates a dummy variable with a value of 1 for people who were in school when they were either 19 or 20 years old.

    # Load libraries 
    library(dplyr)
    library(reshape2)
    
    # Create a sample dataset
    ATT94 <- runif(500, min = 0, max = 1) %>% round(digits = 0)
    ATT96 <- runif(500, min = 0, max = 1) %>% round(digits = 0)
    ATT98 <- runif(500, min = 0, max = 1) %>% round(digits = 0)
    DOB <- rnorm(500, mean = 1977, sd = 5) %>% round(digits = 0)
    df <- cbind(DOB, ATT94, ATT96, ATT98) %>% data.frame()
    
    # Recode ATTx variables with the actual year
    df$ATT94[df$ATT94==1] <- 1994
    df$ATT96[df$ATT96==1] <- 1996
    df$ATT98[df$ATT98==1] <- 1998
    
    # Melt the data into a long format and perform requested analysis
    df %>%
      melt(id = "DOB") %>%
      tbl_df() %>%
      mutate(dummy = ifelse(value - DOB %in% c(19,20), 1, 0))