Search code examples
rr-factor

Creating a factor/categorical variable from 4 dummies


I have a data frame with four columns, let's call them V1-V4 and ten observations. Exactly one of V1-V4 is 1 for each row, and the others of V1-V4 are 0. I want to create a new column called NEWCOL that takes on the value of 3 if V3 is 1, 4 if V4 is 1, and is 0 otherwise.

I have to do this for MANY sets of variables V1-V4 so I would like the solution to be as short as possible so that it will be easy to replicate.


Solution

  • This does it for 4 columns to add a fifth using matrix multiplication:

    > cbind( mydf, newcol=data.matrix(mydf) %*% c(0,0,3,4) )
       V1 V2 V3 V4 newcol
    1   1  0  0  0      0
    2   1  0  0  0      0
    3   0  1  0  0      0
    4   0  1  0  0      0
    5   0  0  1  0      3
    6   0  0  1  0      3
    7   0  0  0  1      4
    8   0  0  0  1      4
    9   0  0  0  1      4
    10  0  0  0  1      4
    

    It's generalizable to getting multiple columns.... we just need the rules. You need to make a matric with the the same number of rows as there are columns in the original data and have one column for each of the new factors needed to build each new variable. This shows how to build one new column from the sum of 3 times the third column plus 4 times the fourth, and another new column from one times the first and 2 times the second.

    > cbind( mydf, newcol=data.matrix(mydf) %*% matrix(c(0,0,3,4,  # first set of factors
                                                         1,2,0,0), # second set
                                                       ncol=2) )
       V1 V2 V3 V4 newcol.1 newcol.2
    1   1  0  0  0        0        1
    2   1  0  0  0        0        1
    3   0  1  0  0        0        2
    4   0  1  0  0        0        2
    5   0  0  1  0        3        0
    6   0  0  1  0        3        0
    7   0  0  0  1        4        0
    8   0  0  0  1        4        0
    9   0  0  0  1        4        0
    10  0  0  0  1        4        0