Search code examples
rdataframeone-hot-encoding

One-hot-encoding a R list of characters


I have the following R dataframe :

id    color
001   blue
001   yellow
001   red
002   blue
003   blue
003   yellow

What's the general method to one-hot-encode such a dataframe into the following :

id    blue    yellow    red
001   1       1         1
002   1       0         0
003   1       0         1

Thank you very much.


Solution

  • Try this. You can create a variable for those observations present in data equals to one and then use pivot_wider() to reshape the values. As you will get NA for classes not present in data, you can replace it with zero using replace(). Here the code using tidyverse functions:

    library(dplyr)
    library(tidyr)
    #Code
    dfnew <- df %>% mutate(val=1) %>%
      pivot_wider(names_from = color,values_from=val) %>%
      replace(is.na(.),0)
    

    Output:

    # A tibble: 3 x 4
         id  blue yellow   red
      <int> <dbl>  <dbl> <dbl>
    1     1     1      1     1
    2     2     1      0     0
    3     3     1      1     0
    

    Some data used:

    #Data
    df <- structure(list(id = c(1L, 1L, 1L, 2L, 3L, 3L), color = c("blue", 
    "yellow", "red", "blue", "blue", "yellow")), class = "data.frame", row.names = c(NA,-6L))