Why do we mention -1
in the formula of model.matrix
function from the stats
package.
training_matrix <-model.matrix(Survived ~.-1, data = training)
The standard titanic
dataset is used in this case.
There is also documentation that says that one hot encoding can be performed using model.matrix
with -1
notation, provided we have declared the factors and numeric in the dataset properly.
The code is as follows
data_1_matrix <-model.matrix(~.-1, data = data_1)
What does this -1
do exactly?
The -1 ensures there is no constant in your model matrix. If you would use
training_matrix <-model.matrix(Survived ~., data = training)
There is a column of ones included and one category is omitted in the model matrix, to ensure your model will not suffer from multicollinearity.
It is up to the user what is preferable: If you use a constant, there will be a 'reference class' in your model. If you don't, there is no reference class.