Search code examples
rdplyrone-hot-encodingwide-format-data

How to turn a hot-encoded variable (multiple columns can be true) to one variable in R?


I am trying to convert columns that are one-hot encoded from a multiple-choice (multiple answers can be true) questionnaire to only one column, without affecting the other variables. I found similar threads that convert a one-hot encoded variable that is binary (only one column is 1, the others are 0).

Example data set:

df <- data.frame(
  `ID` = c(01,02,03),
  `Context_Family` = c(0,1,0),
  `Context_Friends` = c(1,1,0),
  `Context_Spouse` = c(0,1,0),
  `Context_Alone` = c(0,0,1),
  `Disposition_Stress` = c(0,1,0),
  `Disposition_Melancholic` = c(1,1,0),
  Stress = c(20,24,35)
)

What I want is a data frame of the form:

ID | Context | Disposition | Stress

1  |  Friends | Melancholic | 20
2  |  Family, Friends, Spouse | Stress, Melancholic | 24
3  |  Alone | Not stressed nor melancholic | 35

In other words, I want to have one column for Context, one for Disposition. If all values are '0' for the disposition or context for example, I want this to be captured (e.g., ID 3, not stressed nor melancholic), otherwise just put the variables that are '1' (e.g., ID 2, he has family, friends and spouse under 'Context' as he put 1 to those columns in the original data set). The other column 'stress' which is good is not changed.

The code should be adaptable to many more columns that have this format like Context_ and Disposition_, I put here a simplification. In pseudocode it would be nice to have a function or something like that which can basically be:

Give me the column names that are hot-encoded: "Context_", "Disposition_"
For these columns, I will make two columns (in this case) e.g., "Context", "Disposition"
If an observation (row) does not have 1 to any of them, I will simply take all the different column names and say for example for "Context" -> "Not stressed nor melancholic" 

I tried doing this with dplyr but with no success so far.


Solution

  • We can get the data in long format (pivot_longer) bringing Context and Disposition in columns of it's own then for each ID summarise the data in one row by giving the values only for those whose value is 1.

    library(dplyr)
    library(tidyr)
    
    df %>%
      pivot_longer(cols = -c(ID, Stress), names_to = c(".value", "type"),
                   names_sep = "_") %>%
      summarise(across(c(Context, Disposition), ~toString(type[. %in% 1])), 
                      .by = c(ID, Stress))
    
    #     ID Stress Context                 Disposition          
    #  <dbl>  <dbl> <chr>                   <chr>                
    #1     1     20 Friends                 "Melancholic"        
    #2     2     24 Family, Friends, Spouse "Stress, Melancholic"
    #3     3     35 Alone                   ""                   
    

    You may then replace the empty Disposition value ("") to "Not stressed nor melancholic" by adding the following -

    %>%
       mutate(Disposition = replace(Disposition, Disposition == "", 
                             "Not stressed nor melancholic"))