My data has several categorical features with multiple labels (multilabel) per record spread over multiple rows.
myDf <- data.frame(myGroup = c("A", "B", "B", "C", "C", "C"),
myFruit = as.factor(c("apple", "apple", "banana", "apple", "lime", "lemon")),
myCode = as.factor(c("AAA", "AAA", "CCC", "AAA", "BBB", "CCC")))
myDf
myGroup myFruit myCode
A apple AAA
B apple AAA
B banana CCC
C apple AAA
C lime BBB
C lemon CCC
The expected output would look like:
myGroup apple banana lemon lime AAA BBB CCC
A 1 0 0 0 1 0 0
B 1 1 0 0 1 0 1
C 1 0 1 1 1 1 1
How can I one-hot encode this multi label data?
I am including a self-answer, however I suspect there is a better way to do this.
For example. there are 20 fields in need of encoding, should I use repeat mutate/spread 20 times?
Building on great answers in R
like: One Hot Encoding From Multiple Rows in R
and in Python
like: How could I do one hot encoding with multiple values in one cell?
myDf %>%
mutate(n = 1)%>%
spread(myFruit, n, fill = 0, sep = "_") %>%
mutate(n = 1)%>%
spread(myCode, n, fill = 0, sep = "_") %>%
group_by(myGroup) %>%
summarise(across(.cols = everything(), max))
myGroup myFruit_apple myFruit_banana myFruit_lemon myFruit_lime myCode_AAA myCode_BBB myCode_CCC
A 1 0 0 0 1 0 0
B 1 1 0 0 1 0 1
C 1 0 1 1 1 1 1
..............
EDIT:
Thanks to answer https://stackoverflow.com/a/52911170/10276092 I discovered the mltools::one_hot
function which automatically encodes all unordered factors in each row, then combined with a group summarise step produces the expected result:
library(mltools)
library(data.table)
one_hot(as.data.table(myDf))%>%
group_by(myGroup) %>%
summarise(across(.cols = everything(), max))
myGroup myFruit_apple myFruit_banana myFruit_lemon myFruit_lime myCode_AAA myCode_BBB myCode_CCC
1 A 1 0 0 0 1 0 0
2 B 1 1 0 0 1 0 1
3 C 1 0 1 1 1 1 1