Search code examples
rone-hot-encoding

How to One-Hot Encoding stacked columns in R


I have data that look like this

+---+-------+
|   |  col1 |
+---+-------+
| 1 |     A |
| 2 |   A,B |
| 3 |   B,C |
| 4 |     B |
| 5 | A,B,C |
+---+-------+

Expected Output

+---+-----------+
|   | A | B | C |
+---+-----------+
|1  | 1 | 0 | 0 |
|2  | 1 | 1 | 0 |
|3  | 0 | 1 | 1 |
|4  | 0 | 1 | 0 |
|5  | 1 | 1 | 1 |
+---+---+---+---+

How can I encode it like this?


Solution

  • Maybe this could help

    df %>%
      mutate(r = 1:n()) %>%
      unnest(col1) %>%
      table() %>%
      t()
    

    which gives

       col1
    r   A B C
      1 1 0 0
      2 1 1 0
      3 0 1 1
      4 0 1 0
      5 1 1 1
    

    Data

    df <- tibble(
      col1 = list(
        "A",
        c("A", "B"),
        c("B", "C"),
        "B",
        c("A", "B", "C")
      )
    )
    

    If your data is given in the following format

    df <- data.frame(
      col1 = c("A", "A,B", "B,C", "B", "A,B,C")
    )
    

    then you can try

    with(
      df,
      table(rev(stack(setNames(strsplit(col1, ","), seq_along(col1)))))
    )
    

    which gives

       values
    ind A B C
      1 1 0 0
      2 1 1 0
      3 0 1 1
      4 0 1 0
      5 1 1 1