Say I have a dataframe with two columns like this:
Label 1 | Label 2 |
---|---|
A | B |
A | C |
B | C |
C | A |
The values of A, B, and C in the first column are the same values of A, B, and C in the 2nd column. I want the encoding to look like this:
Label 1 | Label 2 | is_A | is_B | is_C |
---|---|---|---|---|
A | B | 1 | 1 | 0 |
A | C | 1 | 0 | 1 |
B | C | 0 | 1 | 1 |
C | A | 1 | 0 | 1 |
Basically, I just want it to check if a value shows up in either column. If so, then code a 1, if not then code a 0.
Now, I know I could write this using an if_else
, like this:
df <- df %>% mutate(is_A = if_else(label1 == 'A' | label2 == 'A'),
is_B = if_else(label1 == 'B' | label2 == 'B'),
is_C = if_else(label1 == 'C' | label2 == 'C'))
but I have many different categories and don't want to write out 50+ if_else statements. I've also tried this:
encoded_labels <- model.matrix(~ label1 + label2 - 1, data = df)
but this creates separate encodings for label1A vs. label2A, etc. Is there a simpler way to do this?
in base R you could Try:
cbind(df, unclass(table(row(df), unlist(df))))
Label_1 Label_2 A B C
1 A B 1 1 0
2 A C 1 0 1
3 B C 0 1 1
4 C A 1 0 1
Another way:
cbind(df, +sapply(unique(unlist(df)), grepl, do.call(paste, df)))
Label_1 Label_2 A B C
1 A B 1 1 0
2 A C 1 0 1
3 B C 0 1 1
4 C A 1 0 1
Note that for the table
you should do:
+unclass(table(row(df), unlist(df))>0)
This will take into consideration rows that have multiple values
If you want to use model.matrix
:
+Reduce("|", split(data.frame(model.matrix(~values+0, stack(df))), col(df)))
valuesA valuesB valuesC
1 1 1 0
2 1 0 1
3 0 1 1
4 1 0 1