Search code examples
rdummy-variable

Generate a dummy matrix from multiple factor columns


I already searched the web and found no answer. I have a big data.frame that contains multiple columns. Each column is a factor variable.

I want to transform the data.frame such that each possible value of the factor variables is a variable that either contains a "1" if the variable is present in the factor column or "0" otherwise.

Here is an example of what I mean.

labels <- c("1", "2", "3", "4", "5", "6", "7") 

#create data frame (note, not all factor levels have to be in the columns,
#NA values are possible)
input <- data.frame(ID = c(1, 2, 3), 
Cat1 = factor(c( 4, 1, 1), levels = labels), 
Cat2 = factor(c(2, NA, 4), levels = labels),
Cat3 = factor(c(7, NA, NA), levels = labels))

#the seven factor levels now are the variables of the data.frame
desired_output <- data.frame(ID = c(1, 2, 3),
Dummy1 = c(0, 1, 1),
Dummy2 = c(1, 0, 0),
Dummy3 = c(0, 0, 0),
Dummy4 = c(1, 0, 1),
Dummy5 = c(0, 0, 0),
Dummy6 = c(0, 0, 0),
Dummy7 = c(1, 0, 0))

input
ID Cat1 Cat2 Cat3
1    4    2    7
2    1 <NA> <NA>
3    1    4 <NA>

desired_output
ID Dummy1 Dummy2 Dummy3 Dummy4 Dummy5 Dummy6 Dummy7
1      0      1      0      1      0      0      1
2      1      0      0      0      0      0      0
3      1      0      0      1      0      0      0

My actual data.frame has over 3000 rows and factors with more than 100 levels. I hope you can help me converting the input to the desired output.

Greetings sush


Solution

  • A couple of methods, that riff off of Gregor's and Aaron's answers.

    From Aaron's. factorsAsStrings=FALSE keeps the factor variables hence all labes when using dcast

    library(reshape2)
    dcast(melt(input, id="ID", factorsAsStrings=FALSE), ID ~ value, drop=FALSE) 
      ID 1 2 3 4 5 6 7 NA
    1  1 0 1 0 1 0 0 1  0
    2  2 1 0 0 0 0 0 0  2
    3  3 1 0 0 1 0 0 0  1
    

    Then you just need to remove the last column.

    From Gregor's

    na.replace <- function(x) replace(x, is.na(x), 0)
    options(na.action='na.pass') # this keeps the NA's which are then converted to zero
    Reduce("+", lapply(input[-1], function(x) na.replace(model.matrix(~ 0 + x))))
      x1 x2 x3 x4 x5 x6 x7
    1  0  1  0  1  0  0  1
    2  1  0  0  0  0  0  0
    3  1  0  0  1  0  0  0
    

    Then you just need to cbind the ID column