I want to create a model matrix for a test dataset which is missing the response variable, and where I can perfectly replicate the results of calling predict() on the model if building predictions using matrix multiplication. See code below for example.
I have code which can do this (again, see below for example), but it requires that I create a placeholder response variable in my test data. This doesn't seem very clean, and I'm wondering if there's a way to get the code to work without this workaround.
# Make data, fit model
set.seed(1); df_train = data.frame(y = rnorm(10), x = rnorm(10), z = rnorm(10))
set.seed(2); df_test = data.frame(x = rnorm(10), z = rnorm(10))
fit = lm(y ~ poly(x) + poly(z), data = df_train)
# Make model matrices. Get error for the test data as 'y' isnt found
mm_train = model.matrix(terms(fit), df_train)
mm_test = model.matrix(terms(fit), df_test) #"Error in eval(predvars, data, env) : object 'y' not found"
# Make fake y variable for test data then build model matrix. I want to know if there's a less hacky way to do this
df_test$y = 1
mm_test = model.matrix(terms(fit), df_test)
# Check predict and matrix multiplication give identical results on test data. NB this is not the case if contstructing the model matrix using (e.g.) mm_test = model.matrix(formula(fit)[-2], df_test) for the reason outlined here https://stackoverflow.com/questions/59462820/why-are-predict-lm-and-matrix-multiplication-giving-different-predictions.
preds_1 = round(predict(fit, df_test), 5)
preds_2 = round(mm_test %*% fit$coefficients, 5)
all(preds_1 == preds_2) #TRUE
Building off of this question, you can extract the formula from the model, set the response to NULL
, and pass that in to model.matrix
:
mm_test = model.matrix(update(formula(fit), NULL ~ .), data = df_test)
Still not "built-in" functionality, but at least this is a more concise one-liner.