Search code examples
rdataframepackagedata-cleaning

One hot coding a data-frame in R


Consider a data-frame df1 similar to the one shown

ID EDUCATION   OCCUPATION      BINARY_VAR
1  Undergrad   Student              1
2  Grad        Business Owner       1
3  Undergrad   Unemployed           0
4  PhD         Other                1

You may create your own random df1 using R-code below

ID <- c(1:4)
EDUCATION <- sample (c('Undergrad', 'Grad', 'PhD'), 4, rep = TRUE)
OCCUPATION <- sample (c('Student', 'Business Owner', 'Unemployed', 'Other'), 4, rep = FALSE)
BINARY_VAR <- sample(c(0, 1), 4, rep = TRUE)
df1 <- data.frame(ID, EDUCATION, OCCUPATION, BINARY_VAR)

# Convert to factor
df1[, names(df1)] <- lapply(df1[, names(df1)] , factor)

From this, I need to derive another data-frame df2 that would look like this

ID Undergrad Grad PhD Student Business Owner Unemployed Other BINARY_VAR
1      1      0    0     1           0           0        0       1
2      1      1    0     0           1           0        0       1
3      1      0    0     0           0           1        0       0
4      1      1    1     0           0           0        1       1

You must have noticed how for level PhD, the other factor levels under EDUCATION also hold true since EDUCATION is the highest education level for that ID. That, however, is the secondary objective.

I can't seem to figure out a way to obtain a data-frame with each column giving the truth value corresponding to individual factor levels in its parent data-frame. Is there a package in R that could help? Or maybe a way to code this?

Can I do this using melt?

I read through previously asked question(s) that looked similar, but they deal with frequencies of occurrence.


Edit:

As recommended by Sumedh, one way to do this is using dummyVars from caret.

dummies <- dummyVars(ID ~ ., data = df1)
df2 <- data.frame(predict(dummies, newdata = df1))
df2 <- df2 [1:7]

Solution

  • tidyr and dplyr combined with that base table() function should work:

    ID <- c(1:4)
    EDUCATION <- c('Undergrad', 'Grad', 'PhD', 'Undergrad')
    OCCUPATION <- c('Student', 'Business Owner', 'Unemployed', 'Other')
    BINARY_VAR <- sample(c(0, 1), 4, rep = TRUE)
    df1 <- data.frame(ID, EDUCATION, OCCUPATION, BINARY_VAR)
    
    # Convert to factor
    df1[, names(df1)] <- lapply(df1[, names(df1)] , factor)
    
    library(dplyr)
    library(tidyr)
    
    edu<-as.data.frame(table(df1[,1:2])) %>% spread(EDUCATION, Freq)
    
    for(i in 1:nrow(edu))
      if(edu[i,]$PhD == 1) 
        edu[i,]$Undergrad <-edu[i,]$Grad <-1
    
    truth_table<-merge(edu,
          as.data.frame(table(df1[,c(1,3)])) %>% spread(OCCUPATION, Freq),
          by = "ID")
    
    truth_table$BINARY_VAR<-df1$BINARY_VAR
    truth_table
    
    ID Grad PhD Undergrad Business Owner Other Student Unemployed BINARY_VAR
    1    0   0         1              0     0       1          0          1
    2    1   0         0              1     0       0          0          1
    3    1   1         1              0     0       0          1          0
    4    0   0         1              0     1       0          0          1
    

    Edit: added a for loop to update the education levels beneath PhD inspired by @ Sumedh's earlier suggestion.