Search code examples
rfactorsdummy-variable

factors to dummies in R


My data consists of data about smartphones. To do a random forest, I need to convert my factor Brand into a lot of dummies.

 I tried this code

 m <- model.matrix( ~ Brand, data = data_price)

 Intercept  BrandApple  BrandAcer  BrandAlcatel ...
 1          0           0          1
 1          1           0          0
 ...

The problem is that the original data has 2039 rows, while the output of this only has 2038. Now I want to add the dummies to my data_price, but this doesn't works.

How could I make a dummy and add it to my data set?


Solution

  • Your approach using model.matrix should work fine, and we only need to figure out what happened to that missing row. I guess the issue is that there are missing values in your factor. Consider the following:

    dat <- factor(mtcars$cyl)
    dat2 <- dat
    dat2[1] <- NA
    

    Here, I have taken a factor, namely the number of cylinders in the mtcars dataset, and for comparison I have created a second factor where I have replaced one value with NA. Let's look at the number of rows that model.matrix will spit out in each case:

    nrow(model.matrix(~dat))
    [1] 32
    nrow(model.matrix(~dat2))
    [1] 31
    

    You see that in the case where the factor variable had a missing value, the output of model.matrix had one row less, which is maybe not surprising.

    You can either create an own factor level for the missing value, or you can safely drop the row with the missing value from your original data set, if this seems appropriate given your application. The output of model.matrix contains row names, which you can use to merge the data back onto the original dataframe if you want to go down that route.