I have a number of CSV files with columns such as gender, age, diagnosis, etc.
Currently, they are coded as such:
ID, gender, age, diagnosis
1, male, 42, asthma
1, male, 42, anxiety
2, male, 19, asthma
3, female, 23, diabetes
4, female, 61, diabetes
4, female, 61, copd
The goal is to transform this data into this target format:
Sidenote: if possible, it would be great to also prepend the original column names to the new column names, e.g. 'age_42' or 'gender_female.'
ID, male, female, 42, 19, 23, 61, asthma, anxiety, diabetes, copd
1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0
2, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0
3, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0
4, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1
I've attempted using reshape2's dcast()
function but am getting combinations resulting in extremely sparse matrices. Here's a simplified example with just age and gender:
data.train <- dcast(data.raw, formula = id ~ gender + age, fun.aggregate = length)
ID, male19, male23, male42, male61, female19, female23, female42, female61
1, 0, 0, 1, 0, 0, 0, 0, 0
2, 1, 0, 0, 0, 0, 0, 0, 0
3, 0, 0, 0, 0, 0, 1, 0, 0
4, 0, 0, 0, 0, 0, 0, 0, 1
Seeing as this is a fairly common task in machine learning data preparation, I imagine there may be other libraries (that I'm unaware of) that are able to perform this transformation.
A base R
option would be
(!!table(cbind(df1[1],stack(df1[-1])[-2])))*1L
# values
#ID 19 23 42 61 anxiety asthma copd diabetes female male
# 1 0 0 1 0 1 1 0 0 0 1
# 2 1 0 0 0 0 1 0 0 0 1
# 3 0 1 0 0 0 0 0 1 1 0
# 4 0 0 0 1 0 0 1 1 1 0
If you need the original name as well
(!!table(cbind(df1[1],Val=do.call(paste, c(stack(df1[-1])[2:1], sep="_")))))*1L
# Val
#ID age_19 age_23 age_42 age_61 diagnosis_anxiety diagnosis_asthma
#1 0 0 1 0 1 1
#2 1 0 0 0 0 1
#3 0 1 0 0 0 0
#4 0 0 0 1 0 0
# Val
#ID diagnosis_copd diagnosis_diabetes gender_female gender_male
#1 0 0 0 1
#2 0 0 0 1
#3 0 1 1 0
#4 1 1 1 0
df1 <- structure(list(ID = c(1L, 1L, 2L, 3L, 4L, 4L), gender = c("male",
"male", "male", "female", "female", "female"), age = c(42L, 42L,
19L, 23L, 61L, 61L), diagnosis = c("asthma", "anxiety", "asthma",
"diabetes", "diabetes", "copd")), .Names = c("ID", "gender",
"age", "diagnosis"), row.names = c(NA, -6L), class = "data.frame")