I hope this question isn't off topic. I know how to code a dummy variable in R, however, I was wondering if I could create it in excel. Lets say I have 3 colors (red, blue, yellow) list under a color variable. R would import this as a factor with 3 levels.
Now if I wanted to do this in excel could I make 3 new predictors (instead of color they now become red, blue, and yellow) and place a 1 in the red column if it is red and 0 otherwise and so on? Or will R continue to interpret this as 3 individual factors with 2 levels each?
So you are manually creating three dummy columns in excel, and want to import them into R? If you later import these columns as numeric rather than factor, there will be no problem.
Well, I still have to remind you that R can code factor to dummy variables, via model.matrix()
. So there is never the need to do this thing yourself. It is definitely OK to use a single column with "red", "blue" and "yellow" in excel, and export it into R as factor.
colour <- gl(3,2,labels=c("red","blue","yellow"))
model.matrix(~ colour - 1)
# colourred colourblue colouryellow
#1 1 0 0
#2 1 0 0
#3 0 1 0
#4 0 1 0
#5 0 0 1
#6 0 0 1
Just another quick question. Using the
model.matrix
for factor colour and other factor variables - how can I incorporate this into my model? When I call a linear model (for example)lm(response ~ predictor.1 + predictor.2 + colour)
will it automatically call the dummy variables or do I need to assign the model.matrix to a vector?
model.matrix
is a service routine, for model fitting routines like lm
, glm
, etc. User can simply use a formula, then model matrix will be constructed behind the scene. So, you don't even need to obtain a model matrix yourself.
For an advanced user, sometimes he may want to use the internal fitting routines lm.fit
or even .lm.fit
. Read ?lm.fit
for those routines. These routines do not accept a model formula, but a model matrix X
and a response vector y
. In such situation, user is fully responsible to generate X
and y
himself.