Search code examples
rdataframedplyrone-hot-encodinglong-format-data

Transforming multiple hot-encoded columns and converting to long-format


I have a quite complex data frame structure:


ID = c(1,2,3)
Sessions = c("2023-11-14 19:01:39+01:00", "2023-11-14 20:01:39+01:00", "2023-11-14 21:01:39+01:00")

P_affect = c(10,20,30) 
N_affect = c(15,30,40)
NMeals = c(0,1,2)

Meal1_Where_Home = c(NA, 1, 0)
Meal1_Where_Restaurant = c(NA, 0, 1)
Meal1_Who_Alone =  c(NA, 1, 0)
Meal1_Who_Friends = c(NA, 0 , 1 )
Meal1_Type_Big_Meal = c(NA, 1, 1)
Meal1_Type_Small_Meal = c(NA, 0, 0)

Meal2_Where_Home = c(NA, NA, 1)
Meal2_Where_Restaurant = c(NA, NA, 0)
Meal2_Who_Alone =  c(NA, NA, 1)
Meal2_Who_Friends = c(NA, NA , 0 )
Meal2_Type_Big_Meal = c(NA, NA, 1)
Meal2_Type_Small_Meal = c(NA, NA, 0)

Meal3_Where_Home = c(NA, NA, NA)
Meal3_Where_Restaurant = c(NA, NA, NA)
Meal3_Who_Alone =  c(NA, NA, NA)
Meal3_Who_Friends = c(NA, NA , NA )
Meal3_Type_Big_Meal = c(NA, NA, NA)
Meal3_Type_Small_Meal = c(NA, NA, NA)


# Create a data frame
df1 <- data.frame(ID, Sessions, P_affect, N_affect, NMeals, Meal1_Where_Home, Meal1_Where_Restaurant,
                            Meal1_Who_Alone, Meal1_Who_Friends, Meal1_Type_Big_Meal, Meal1_Type_Small_Meal,
                            Meal2_Where_Home, Meal2_Where_Restaurant, Meal2_Who_Alone, Meal2_Who_Friends,
                            Meal2_Type_Big_Meal, Meal2_Type_Small_Meal, Meal3_Where_Home, Meal3_Where_Restaurant,
                            Meal3_Who_Alone, Meal3_Who_Friends, Meal3_Type_Big_Meal, Meal3_Type_Small_Meal)


df2 <- data.frame(
  `ID` = c(1,2,3),
  `Context_Family` = c(0,1,0),
  `Context_Friends` = c(1,1,0),
  `Context_Spouse` = c(0,1,0),
  `Context_Alone` = c(0,0,1),
  `Disposition_Stress` = c(0,1,0),
  `Disposition_Melancholic` = c(1,1,0),
  Stress = c(20,24,35)
)

df = merge(df1,df2, by = 'ID')

What I would like to is essentially two steps:

  1. convert all the columns of a specific type (that start with "Context_" or "Disposition_") into their not hot-encoded form
  2. make the data set long format by meal number

Desired output:

ID | Sessions            | P_affect | N_affect | NMeals | MealNumber | MealObs | MealValue | Context | Disposition

1  | 2023-11-14 19:01:39 | 10       | 15       | 0      | Meal1      | Where   | NA        | Friends | Melancholic
    
1  | 2023-11-14 19:01:39 | 10       | 15       | 0      | Meal1      | Who     | NA        | Friends | Melancholic

I tried for step1:

df_modified = df %>%
  pivot_longer(col=starts_with("Context"), names_to="Context", names_prefix="Context_") %>% 
  filter(value==1) %>%
  select(-value) 

But this did not work very well, and also I would like an approach that just asks for column names and does the hot-encoding transformation for all of them, instead taking one by one. And for the long formatting:

data_long <- df %>%
  pivot_longer(cols = starts_with("Meal"),
               names_to = c("Meal Number", "Value"),
               names_sep = "_",
               values_to = "value")

Which works, but on a data set without hot-encoded values. I included a larger data frame just to check if the code works properly for all cases.


Solution

  • Your df dataset isn´t tidy because there´s more than one value per variable/observation. For ID == 2, Context is "Family", "Friends" and "Spouse". You´ll get list-columns:

    library(dplyr)
    
    # --------
    df_modified <- select(df, ID, starts_with("Context"))
    
    df_modified
    
      ID Context_Family Context_Friends Context_Spouse Context_Alone
    1  1              0               1              0             0
    2  2              1               1              1             0
    3  3              0               0              0             1
    

    Pivoting:

    df_modified <- pivot_longer(df_modified, -ID, names_to = "column", values_to = "value")
    
    # A tibble: 12 × 3
          ID column          value
       <dbl> <chr>           <dbl>
     1     1 Context_Family      0
     2     1 Context_Friends     1
     3     1 Context_Spouse      0
     4     1 Context_Alone       0
     5     2 Context_Family      1
     6     2 Context_Friends     1
     7     2 Context_Spouse      1
     8     2 Context_Alone       0
     9     3 Context_Family      0
    10     3 Context_Friends     0
    11     3 Context_Spouse      0
    12     3 Context_Alone       1
    
    #
    df_modified <- mutate(
      df_modified, 
      value = if_else(value == 1, str_extract(column, "(?<=_).*$"), NA_character_), 
      column = str_extract(column, "^.*(?=_)"))
    
    # A tibble: 12 × 3
          ID column  value  
       <dbl> <chr>   <chr>  
     1     1 Context NA     
     2     1 Context Friends
     3     1 Context NA     
     4     1 Context NA     
     5     2 Context Family 
     6     2 Context Friends
     7     2 Context Spouse 
     8     2 Context NA     
     9     3 Context NA     
    10     3 Context NA     
    11     3 Context NA     
    12     3 Context Alone
    
    #
    df_modified <- filter(df_modified, !is.na(value))
    
    # A tibble: 5 × 3
         ID column  value  
      <dbl> <chr>   <chr>  
    1     1 Context Friends
    2     2 Context Family 
    3     2 Context Friends
    4     2 Context Spouse 
    5     3 Context Alone 
    
    #
    df_modified <- pivot_wider(df_modified, names_from = column, values_from = value)
    
    Warning message:
    Values from `value` are not uniquely identified; output will contain list-cols.
    • Use `values_fn = list` to suppress this warning.
    • Use `values_fn = {summary_fun}` to summarise duplicates.
    • Use the following dplyr code to identify duplicates.
      {data} %>%
      dplyr::group_by(ID, column) %>%
      dplyr::summarise(n = dplyr::n(), .groups = "drop") %>%
      dplyr::filter(n > 1L) 
    > df_modified
    # A tibble: 3 × 2
         ID Context  
      <dbl> <list>   
    1     1 <chr [1]>
    2     2 <chr [3]>
    3     3 <chr [1]>
    

    EDIT: A alternative for list-columns is to summarise before the pivot_wider step:

    summarise(
      df_modified, .by = c(ID, column),
      value = str_flatten_comma(value, na.rm = TRUE))
    
    # A tibble: 3 × 3
       ID column  value                  
    <dbl> <chr>   <chr>                  
      1     1 Context Friends                
      2     2 Context Family, Friends, Spouse
      3     3 Context Alone 
    
    #
    df_modified <- pivot_wider(df_modified, names_from = column, values_from = value)
    
    # A tibble: 3 × 2
         ID Context                
      <dbl> <chr>                  
    1     1 Friends                
    2     2 Family, Friends, Spouse
    3     3 Alone