Search code examples
rdataframedummy-variabledummy-data

Make dummy variables for a categorial variable


Let's say I have a data frame df as follows:

df <- data.frame(type = c("A","B","AB","O","O","B","A"))

Obviously there are 4 kinds of type. However, in my actual data, I don't know how many kinds are in a column type. The number of dummy variables should be one less than the number of kinds in type. In this example, number of dummy variables should be 3. My expected output looks like this:

df <- data.frame(type = c("A","B","AB","O","O","B","A"),
                 A = c(1,0,0,0,0,0,1),
                 B = c(0,1,0,0,0,1,0),
                 AB = c(0,0,1,0,0,0,0))

Here I used A, B and AB as dummy variables, but whatever I choose from type doesn't matter. Even if I don't know the values of type and the number of kinds, I somehow want to make it as dummy variables.


Solution

  • The number of dummy variables should be one less than the number of kinds in type.

    Here I used "A", "B" and "AB" as dummy variables, but whatever I choose from type doesn't matter.

    Even if I don't know the values in type and the number of kinds, I somehow want to make it as dummy variables.

    This is treatment contrasts coding. First, you need a factor variable.

    ## option 1: if you care the order of dummy variables
    ## the 1st level is not in dummy variables
    ## I do this to match your example output with "A", "B" and "AB"
    f <- factor(df$type, levels = c("O", "A", "B", "AB"))
    
    ## option 2: if you don't care, then let R automatically order levels
    f <- factor(df$type)
    

    Now, apply treatment contrasts coding.

    ## option 1 (recommended): using contr.treatment()
    m <- contr.treatment(nlevels(f))[f, ]
    
    ## option 2 (less efficient): using model.matrix()
    m <- model.matrix(~ f)[, -1]
    

    Finally you want to have nice row/column names for readability.

    dimnames(m) <- list(1:length(f), levels(f)[-1])
    

    The resulting m looks like:

    #   A  B  AB
    #1  1  0   0
    #2  0  1   0
    #3  0  0   1
    #4  0  0   0
    #5  0  0   0
    #6  0  1   0
    #7  1  0   0
    

    This is a matrix. If you want a data frame, do data.frame(m).