Search code examples
rlinear-regressiondummy-variable

How to create dummy variables using various columns sharing same levels


I am trying to get the dummy variables for the following table:

df:

Value1       var1       var2      var3      var4
9.330154398  HomeATL    AwayHOU   HomeEast  AwayWest
32.43881489  AwaySDN    HomeATL   HomeWest  AwayWest
54.77178387  AwayLAN    HomeATL   AwayEast  HomeSame
54.77178387  AwayLAN    HomeATL   AwayWest  HomeEast

The column var1 and var2 share the same levels. On the other hand, the column var3 and var4 whare their levels as well. Therefore, I need that during the creation of the dummy variables, the new columns created should not have repeated levels. I mean, in the example of var3 and var4, for the 1st and 3rd row, both have AwayWest, so I only need to fill 1 column named AwayWest with a number 1 on each row.

My desired output is:

Value1  HomeEast    HomeWest    AwayEast    AwayWest    HomeSame    HomeATL AwayHOU AwaySDN AwayLAN
9.330154398 1   0   0   1   0   1   1   0   0
32.43881489 0   1   0   1   0   1   0   1   0
54.77178387 0   0   1   0   1   1   0   0   1
54.77178387 1   0   0   1   0   1   0   0   1

I tried creating a new column of 1 (col1) for each column to be transformed:

spread(df,var1, col1) %>%
spread(var2, col1)%>%
spread(var3, col1)%>%
spread(var1, col1)

However it is not working.

Thanks


Solution

  • A base R option would be table

    tbl <- +(table(c(col(df1[-1])), unlist(df1[-1]) ) > 0)
    cbind(df1[1], as.data.frame.matrix(tbl))
    

    Or using tidyverse

    library(tidyverse)
    rownames_to_column(df1, 'rn') %>%
        gather(key, val, var1:var4) %>% 
        count(rn, val) %>%
        spread(val, n, fill = 0)  %>%
        select(-rn) %>%
        bind_cols(df1[1], .)
    

    data

    df1 <- structure(list(Value1 = c(9.330154398, 32.43881489, 54.77178387, 
    54.77178387), var1 = c("HomeATL", "AwaySDN", "AwayLAN", "AwayLAN"
    ), var2 = c("AwayHOU", "HomeATL", "HomeATL", "HomeATL"), var3 = c("HomeEast", 
    "HomeWest", "AwayEast", "AwayWest"), var4 = c("AwayWest", "AwayWest", 
    "HomeSame", "HomeEast")), class = "data.frame", row.names = c(NA, 
    -4L))