Search code examples
rmultiple-columnsnew-operatorcase-when

Creating a New Race Variable from Existing Column in Data Frame in R (with case_when function)


I am working with data from the National Health Interview Survey and trying to simplify the race variable into 5 buckets. I want to create a new column titled "RACE" from existing data which includes Asian =1, Black=2, White (non-Hispanic)=3, Hispanic=4, Other=5. Currently, the race variable is titled "RACEA" and includes several codes indicating race as written here:

411, 412, 416, 434= Asian 200=Black 100=White 310,580,600=Other

BUT, the variable indicating Hispanic ethnicity is a separate variable titled HISPETH. With this variable,

10=non-Hispanic 20,23,30,40,50,61,62,63,70=Hispanic

Therefore, to create the white (non-Hispanic) and Hispanic value I need R to use both the column values of RACEA and HISPETH.

Here is the code I attempted to run in order to do all this, but I was met with the error message that "the longer the object length is not a multiple of shorter object length" for the portion with the list of HISPETH values as shown below.

What should I do? I am open to using other functions besides case_when, this is just what I've used in the past. Thanks!

`NHIS_test <- NHIS1 %>% 
      mutate(RACE = case_when(RACEA <= 411 ~ '1', 
                              RACEA <= 412 ~ '1', 
                              RACEA <= 416 ~ '1', 
                              RACEA <= 434 ~ '1', 
                              RACEA <= 200 ~ '2',
                              RACEA <= 100 & HISPETH <= 10 ~ '3',
                              HISPETH <= c(20:70) ~ '4', 
                              RACEA<=100 & HISPETH <= c(20,23,30,40,50,61,62,63,70) ~ '4', 
                              RACEA <= 310 ~ '5', 
                              RACEA <= 580 ~ '5',
                              RACEA <= 600 ~ '5',
                              TRUE ~ 'NA'))`

Solution

  • To compare a single value you should use ==, to compare multiple values use %in%.

    library(dplyr)
    
    NHIS_test <- NHIS1 %>% 
                    mutate(RACE = case_when(
                      RACEA %in% c(411, 412, 416, 434) ~ 1, 
                      RACEA == 200 ~ 2, 
                      RACEA == 100 & HISPETH == 10 ~ 3,
                      RACEA == 100 & HISPETH %in% c(20,23,30,40,50,61,62,63,70) ~ 4, 
                      RACEA %in% c(310, 580, 600) ~ 5))
    

    If none of the above condition is satisfied it will return NA by default.