Search code examples
rdynamicrefactoringtidyrdummy-variable

Dynamically Create Variables Based on Binary Indicators in R


I have user-level data that looks like this:

ID  V1 V2 V3 V4
001 1  0  1  0
002 0  1  0  1
003 0  0  0  0
004 1  1  1  0

In the above example, I would like an elegant solution (likely using tidyr) to dynamically refactor this to appear as:

ID  Num_Vars Var1 Var2 Var3
001 2        V1   V3   NA
002 2        V2   V4   NA
003 0        NA   NA   NA
004 3        V1   V2   V3

Note that this example is simplified and there are actually many variables. The point is to have code that detects how many variables should be created, based on the maximum number of 1s in Var1-VarX that are populated for any user.


Solution

  • This feels like some fairly standard reshaping: convert to long, manipulate by group, convert back to wide:

    df %>%
        gather(key = var, value = value, -ID) %>%
        group_by(ID) %>%
        filter(value != 0) %>%
        mutate(Num_Vars = n(),
               Var_Label = paste0("Var", 1:n())) %>%
        spread(key = Var_Label, value = var) %>%
        select(-value) %>%
        full_join(distinct(df, ID))
    # Source: local data frame [4 x 5]
    # Groups: ID [?]
    # 
    #      ID Num_Vars  Var1  Var2  Var3
    #   <int>    <int> <chr> <chr> <chr>
    # 1     1        2    V1    V3  <NA>
    # 2     2        2    V2    V4  <NA>
    # 3     4        3    V1    V2    V3
    # 4     3       NA  <NA>  <NA>  <NA>
    

    Using this data reproducibly shared with dput():

    df = structure(list(ID = 1:4, V1 = c(1L, 0L, 0L, 1L), V2 = c(0L, 1L, 
    0L, 1L), V3 = c(1L, 0L, 0L, 1L), V4 = c(0L, 1L, 0L, 0L)), .Names = c("ID", 
    "V1", "V2", "V3", "V4"), class = "data.frame", row.names = c(NA, 
    -4L))