Search code examples
rone-hot-encoding

How to one-hot encode multiple features, that each have multiple labels


My data has several categorical features with multiple labels (multilabel) per record spread over multiple rows.

myDf <- data.frame(myGroup = c("A", "B", "B", "C", "C", "C"),
                   myFruit = as.factor(c("apple", "apple", "banana", "apple", "lime", "lemon")),
                   myCode = as.factor(c("AAA", "AAA", "CCC", "AAA", "BBB", "CCC")))
myDf
myGroup myFruit myCode
      A   apple    AAA
      B   apple    AAA
      B  banana    CCC
      C   apple    AAA
      C    lime    BBB
      C   lemon    CCC

The expected output would look like:

myGroup apple banana lemon  lime   AAA   BBB   CCC
A           1      0     0     0     1     0     0
B           1      1     0     0     1     0     1
C           1      0     1     1     1     1     1

How can I one-hot encode this multi label data?

I am including a self-answer, however I suspect there is a better way to do this.

For example. there are 20 fields in need of encoding, should I use repeat mutate/spread 20 times?


Solution

  • Building on great answers in R like: One Hot Encoding From Multiple Rows in R

    and in Python like: How could I do one hot encoding with multiple values in one cell?

    myDf %>% 
      mutate(n = 1)%>% 
      spread(myFruit, n, fill = 0, sep = "_") %>% 
      mutate(n = 1)%>% 
      spread(myCode, n, fill = 0, sep = "_") %>% 
      group_by(myGroup) %>% 
      summarise(across(.cols = everything(), max))
    
    myGroup myFruit_apple myFruit_banana myFruit_lemon myFruit_lime myCode_AAA myCode_BBB myCode_CCC
    A                   1              0             0            0          1          0          0
    B                   1              1             0            0          1          0          1
    C                   1              0             1            1          1          1          1
    

    ..............

    EDIT:

    Thanks to answer https://stackoverflow.com/a/52911170/10276092 I discovered the mltools::one_hot function which automatically encodes all unordered factors in each row, then combined with a group summarise step produces the expected result:

    library(mltools)
    library(data.table)
    one_hot(as.data.table(myDf))%>% 
      group_by(myGroup) %>% 
      summarise(across(.cols = everything(), max))
         
      myGroup myFruit_apple myFruit_banana myFruit_lemon myFruit_lime myCode_AAA myCode_BBB myCode_CCC
    1 A                   1              0             0            0          1          0          0
    2 B                   1              1             0            0          1          0          1
    3 C                   1              0             1            1          1          1          1