I have the following data frame:
id = c(1,2,3)
where_home = c(1, 0, NA)
where_work = c(0, 1, NA)
with_alone = c(0,0,0)
with_parents = c(0,1,1)
with_colleagues = c(1,1,0)
gender_male = c(1,0,1)
gender_female = c(0,1,0)
p_affect = c(10,14,20)
n_affect = c(20,30,10)
df = data.frame(id, where_home, where_work,
with_alone, with_parents, with_colleagues,
gender_male, gender_female, p_affect, n_affect)
Where there are 3 IDs, and multiple hot-encoded columns (where, with, gender) along with not hot-encoded columns (p_affect, n_affect).
What I would like is to convert the hot-encoded columns while keeping the not hot-encoded ones as they are.
I did the following:
library(dplyr)
df_transformed <- df %>%
rowwise() %>%
mutate(Gender = case_when(
gender_male == 1 ~ "Male",
gender_female == 1 ~ "Female",
TRUE ~ NA_character_
),
Context = paste(
ifelse(with_alone == 1, "Alone", ""),
ifelse(with_parents == 1, "Parents", ""),
ifelse(with_colleagues == 1, "Colleagues", ""),
collapse = " and "
),
Location = trimws(ifelse(
where_home == 1 & where_work == 1,
'Home and Work',
paste(
ifelse(where_home == 1, 'Home', ''),
ifelse(where_work == 1, 'Work', '')
)
))) %>%
select(-starts_with("gender_"), -starts_with("with_"))
df_transformed <- df_transformed %>%
select(id, Gender, Context, Location, p_affect, n_affect)
And the result:
id Gender Context Location p_affect n_affect
<dbl> <chr> <chr> <chr> <dbl> <dbl>
1 1 Male " Colleagues" Home 10 20
2 2 Female " Parents Colleagues" Work 14 30
3 3 Male " Parents " NA 20 10
This seems to work, but there are a few issues:
pseudocode:
vector_of_columns_that_are_hot_encoded = c('where', 'with', 'gender')
for column in vector_of_columns:
# modify the hot-encoded columns and make a new data frame while keeping the columns that are not in the vector_of_columns_that_are_hot_encoded as they are
# mind that some hot-encoded columns are binary (gender), while others have multiple values. If multiple values are present, put them in the data frame using "Value 1 and Value 2 and ..."
I think there has to be a simple way of doing this. As I am a beginner with dplyr, if possible to explain the code and keep it simple.
If the names are that nicely formatted as shown, you could strsplit
at '_'
and exploit pre and suffixes in by
.
> cols <- 2:8
> by(names(df)[cols], sapply(strsplit(names(df)[cols], '_'), `[`, 1), \(x) {
+ sb <- sub('.*_', '', x)
+ apply(df[, x] == 1, 1, \(x) toString(unique(sb[x])))
+ }) |> do.call(what='cbind') |> cbind(df[-(cols)])
gender where with id p_affect n_affect
1 male home colleagues 1 10 20
2 female work parents, colleagues 2 14 30
3 male NA parents 3 20 10
Explanation: The strsplit
splits the selected variables at the "_"
and the sapply
around it selects the 1
st element, so the (complete) variable names are split by
their prefixes. The sub
only leaves the suffix, so the comparison with 1
in apply
selects the correct endings, a little nice with a comma using toString
. Next we cbind
first the created selections and finally everything to the rest of df
.
Data:
> dput(df)
structure(list(id = c(1, 2, 3), where_home = c(1, 0, NA), where_work = c(0,
1, NA), with_alone = c(0, 0, 0), with_parents = c(0, 1, 1), with_colleagues = c(1,
1, 0), gender_male = c(1, 0, 1), gender_female = c(0, 1, 0),
p_affect = c(10, 14, 20), n_affect = c(20, 30, 10)), class = "data.frame", row.names = c(NA,
-3L))