I have a data set with a column that contains multiple values, separated by a ;
.
name sex good_at
1 Tom M Drawing;Hiking
2 Mary F Cooking;Joking
3 Sam M Running
4 Charlie M Swimming
I would like the create a dummy variable for each unique value in good_at
such each dummy variable contains a TRUE
or FALSE
to indicate whether or not that individual possess that particular value.
Drawing Cooking
True False
False True
False False
False False
To create dummy variables for each unique value in good_at
required the following steps:
good_at
into multiple rowsdummy::dummy()
- for each value in good_at
for each name
-sex
pairname
, sex
, key
and value
key
contains all the dummy variable column namesvalue
contains the values in each dummy variablevalue
is not zerokey
# load necessary packages ----
library(dummy)
library(tidyverse)
# load necessary data ----
df <-
read.table(text = "name sex good_at
1 Tom M Drawing;Hiking
2 Mary F Cooking;Joking
3 Sam M Running
4 Charlie M Swimming"
, header = TRUE
, stringsAsFactors = FALSE)
# create a longer version of df -----
# where one record represents
# one unique name, sex, good_at value
df_clean <-
df %>%
separate_rows(good_at, sep = ";")
# create dummy variables for all unique values in "good_at" column ----
df_dummies <-
df_clean %>%
select(good_at) %>%
dummy() %>%
bind_cols(df_clean) %>%
# drop "good_at" column
select(-good_at) %>%
# make the tibble long by reshaping it into 4 columns:
# name, sex, key and value
# where key are the all dummy variable column names
# and value are the values in each dummy variable
gather(key, value, -name, -sex) %>%
# keep records where
# value is not equal to zero
# note: this is due to "Tom" having both a
# "good_at_Drawing" value of 0 and 1.
filter(value != 0) %>%
# make the tibble wide
# with one record per name-sex pair
# and as many columns as there are in key
# with their values from value
# and filling NA values to 0
spread(key, value, fill = 0) %>%
# for each name-sex pair
# cast the dummy variables into logical vectors
group_by(name, sex) %>%
mutate_all(funs(as.integer(.) %>% as.logical())) %>%
ungroup() %>%
# just for safety let's join
# the original "good_at" column
left_join(y = df, by = c("name", "sex")) %>%
# bring the original "good_at" column to the left-hand side
# of the tibble
select(name, sex, good_at, matches("good_at_"))
# view result ----
df_dummies
# A tibble: 4 x 9
# name sex good_at good_at_Cooking good_at_Drawing good_at_Hiking
# <chr> <chr> <chr> <lgl> <lgl> <lgl>
# 1 Char… M Swimmi… FALSE FALSE FALSE
# 2 Mary F Cookin… TRUE FALSE FALSE
# 3 Sam M Running FALSE FALSE FALSE
# 4 Tom M Drawin… FALSE TRUE TRUE
# ... with 3 more variables: good_at_Joking <lgl>, good_at_Running <lgl>,
# good_at_Swimming <lgl>
# end of script #