Search code examples
rone-hot-encoding

How can I one-hot-encode multiple columns in R that share categories?


Say I have a dataframe with two columns like this:

Label 1 Label 2
A B
A C
B C
C A

The values of A, B, and C in the first column are the same values of A, B, and C in the 2nd column. I want the encoding to look like this:

Label 1 Label 2 is_A is_B is_C
A B 1 1 0
A C 1 0 1
B C 0 1 1
C A 1 0 1

Basically, I just want it to check if a value shows up in either column. If so, then code a 1, if not then code a 0.

Now, I know I could write this using an if_else, like this:

df <- df %>% mutate(is_A = if_else(label1 == 'A' | label2 == 'A'), 
is_B = if_else(label1 == 'B' | label2 == 'B'), 
is_C = if_else(label1 == 'C' | label2 == 'C'))

but I have many different categories and don't want to write out 50+ if_else statements. I've also tried this:

encoded_labels <- model.matrix(~ label1 + label2 - 1, data = df)

but this creates separate encodings for label1A vs. label2A, etc. Is there a simpler way to do this?


Solution

  • in base R you could Try:

    cbind(df, unclass(table(row(df), unlist(df))))
    
      Label_1 Label_2 A B C
    1       A       B 1 1 0
    2       A       C 1 0 1
    3       B       C 0 1 1
    4       C       A 1 0 1
    

    Another way:

    cbind(df, +sapply(unique(unlist(df)), grepl, do.call(paste, df)))
    
      Label_1 Label_2 A B C
    1       A       B 1 1 0
    2       A       C 1 0 1
    3       B       C 0 1 1
    4       C       A 1 0 1
    

    Note that for the table you should do:

    +unclass(table(row(df), unlist(df))>0)
    

    This will take into consideration rows that have multiple values

    If you want to use model.matrix:

    +Reduce("|", split(data.frame(model.matrix(~values+0, stack(df))), col(df)))
      valuesA valuesB valuesC
    1       1       1       0
    2       1       0       1
    3       0       1       1
    4       1       0       1