Search code examples
rone-hot-encoding

Convert a factor into binary dummies but not all factors present


I have a number of data frames that contain a factor that I wish to expand out into a number of binary equivalents (one hot encoding). However, in each data frame not all the possible factors are present, but I do know what all the possible factors are (there are 70 such factors). I want to add all the possible binary dummies to every data frame.

From the code below, I can create the dummies within each data frame, but not all the possible dummies. For example, set1.df does not have any person in category "E" or "F", whilst set2.df does not have anyone in category "D". What's needed is additional columns set1.dfE set1.dfF in set1.df that are all 0, and column set2.dfD in set2.df that is all zeros. I can not rbind set1.df and set2.df before creating the dummies because I need to do some processing of each data frame using the binary variables before rbinding. Just to re-iterate I know what levels are possible in my data before hand, eg "A" to "F".

library(dummies)

person_id <- c(1,2,3,4,5,6,7,8,9,10)
person_cat <- c("A","B","C","A","B","C","D","A","A","A")
set1.df <- data.frame(person_id,person_cat)

person_id <- c(11,12,13,14,15,16,17,18,19,20)
person_cat <- c("A","B","C","A","B","C","E","E","F","A")
set2.df <- data.frame(person_id,person_cat)

dummies1 <- dummy(set1.df[,2])
dummies2 <- dummy(set2.df[,2])

dummies1
dummies2

The expected output is:

> dummies1
      set1.dfA set1.dfB set1.dfC set1.dfD set1.dfE set1.dfF
 [1,]        1        0        0        0        0        0
 [2,]        0        1        0        0        0        0
 [3,]        0        0        1        0        0        0
 [4,]        1        0        0        0        0        0
 [5,]        0        1        0        0        0        0
 [6,]        0        0        1        0        0        0
 [7,]        0        0        0        1        0        0
 [8,]        1        0        0        0        0        0
 [9,]        1        0        0        0        0        0
[10,]        1        0        0        0        0        0
> dummies2
      set2.dfA set2.dfB set2.dfC set2.df$D set2.dfE set2.dfF
 [1,]        1        0        0        0        0        0
 [2,]        0        1        0        0        0        0
 [3,]        0        0        1        0        0        0
 [4,]        1        0        0        0        0        0
 [5,]        0        1        0        0        0        0
 [6,]        0        0        1        0        0        0
 [7,]        0        0        0        0        1        0
 [8,]        0        0        0        0        1        0
 [9,]        0        0        0        0        0        1
[10,]        1        0        0        0        0        0

Solution

  •  library(dummies)
    
    person_id <- c(1,2,3,4,5,6,7,8,9,10)
    person_cat <- c("A","B","C","A","B","C","D","A","A","A")
    person_cat < -factor(person_cat,levels=c("A","B","C","D","E","F"))
    set1.df <- data.frame(person_id,person_cat)
    
    person_id <- c(11,12,13,14,15,16,17,18,19,20)
    person_cat <- c("A","B","C","A","B","C","E","E","F","A")
    person_cat <- factor(person_cat,levels=c("A","B","C","D","E","F"))
    set2.df <- data.frame(person_id,person_cat)
    
    dummies1 <- dummy(set1.df[,2],drop=FALSE)
    dummies2 <- dummy(set2.df[,2],drop=FALSE)
    
    dummies1
    dummies2