Search code examples
rdataframedplyrdata-manipulationone-hot-encoding

Decoding multiple hot-encoded columns efficiently in R


I have the following data frame:

id = c(1,2,3)

where_home = c(1, 0, NA)
where_work = c(0, 1, NA)

with_alone = c(0,0,0)
with_parents = c(0,1,1)
with_colleagues = c(1,1,0)

gender_male = c(1,0,1)
gender_female = c(0,1,0)

p_affect = c(10,14,20)
n_affect = c(20,30,10)


df = data.frame(id, where_home, where_work,
                with_alone, with_parents, with_colleagues,
                gender_male, gender_female, p_affect, n_affect)

Where there are 3 IDs, and multiple hot-encoded columns (where, with, gender) along with not hot-encoded columns (p_affect, n_affect).

What I would like is to convert the hot-encoded columns while keeping the not hot-encoded ones as they are.

I did the following:

library(dplyr)

df_transformed <- df %>%
  rowwise() %>%
  mutate(Gender = case_when(
    gender_male == 1 ~ "Male",
    gender_female == 1 ~ "Female",
    TRUE ~ NA_character_
  ),
  Context = paste(
    ifelse(with_alone == 1, "Alone", ""),
    ifelse(with_parents == 1, "Parents", ""),
    ifelse(with_colleagues == 1, "Colleagues", ""),
    collapse = " and "
  ),
  Location = trimws(ifelse(
    where_home == 1 & where_work == 1, 
    'Home and Work', 
    paste(
      ifelse(where_home == 1, 'Home', ''),
      ifelse(where_work == 1, 'Work', '')
    )
  ))) %>%
  select(-starts_with("gender_"), -starts_with("with_"))

df_transformed <- df_transformed %>%
  select(id, Gender, Context, Location, p_affect, n_affect)

And the result:

     id Gender Context               Location p_affect n_affect
  <dbl> <chr>  <chr>                 <chr>       <dbl>    <dbl>
1     1 Male   "  Colleagues"        Home           10       20
2     2 Female " Parents Colleagues" Work           14       30
3     3 Male   " Parents "           NA             20       10

This seems to work, but there are a few issues:

  • some of the spacing seems strange in the 'Context' column. I would prefer a cleaner format without any spaces separated by 'and' (e.g. "Parents and Colleagues" instead of " Parents Colleagues"
  • in this approach, I need to define each column and each case separately, which is tedious, as the original data frame is massive with many columns and possible options. I would like something like:
pseudocode:

vector_of_columns_that_are_hot_encoded = c('where', 'with', 'gender')
for column in vector_of_columns:
 # modify the hot-encoded columns and make a new data frame while keeping the columns that are not in the vector_of_columns_that_are_hot_encoded as they are
# mind that some hot-encoded columns are binary (gender), while others have multiple values. If multiple values are present, put them in the data frame using "Value 1 and Value 2 and ..."

I think there has to be a simple way of doing this. As I am a beginner with dplyr, if possible to explain the code and keep it simple.


Solution

  • If the names are that nicely formatted as shown, you could strsplit at '_' and exploit pre and suffixes in by.

    > cols <- 2:8
    > by(names(df)[cols], sapply(strsplit(names(df)[cols], '_'), `[`, 1), \(x) {
    +   sb <- sub('.*_', '', x)
    +   apply(df[, x] == 1, 1, \(x) toString(unique(sb[x])))
    + }) |> do.call(what='cbind') |> cbind(df[-(cols)])
      gender where                with id p_affect n_affect
    1   male  home          colleagues  1       10       20
    2 female  work parents, colleagues  2       14       30
    3   male    NA             parents  3       20       10
    

    Explanation: The strsplit splits the selected variables at the "_" and the sapply around it selects the 1st element, so the (complete) variable names are split by their prefixes. The sub only leaves the suffix, so the comparison with 1 in apply selects the correct endings, a little nice with a comma using toString. Next we cbind first the created selections and finally everything to the rest of df.


    Data:

    > dput(df)
    structure(list(id = c(1, 2, 3), where_home = c(1, 0, NA), where_work = c(0, 
    1, NA), with_alone = c(0, 0, 0), with_parents = c(0, 1, 1), with_colleagues = c(1, 
    1, 0), gender_male = c(1, 0, 1), gender_female = c(0, 1, 0), 
        p_affect = c(10, 14, 20), n_affect = c(20, 30, 10)), class = "data.frame", row.names = c(NA, 
    -3L))