I created a logistic regression model using the mlr3 package in R. I outputted residuals from the model, but I can't work out how they have been calculated - they do not correspond to any residual calculation that I know of.
Suppose I use the mlr3 package to create a logistic regression model:
library(mlr3)
library(tidyverse)
#create dummy data
data <- data.frame(
predictor = c(rnorm(50, mean = 0), rnorm(50, mean = 1)),
dependant = as.factor(c(rep(0,50), rep(1, 50)))
)
#define and train a logistic regression model
classifier_log_reg <- mlr_learners$get("classif.log_reg")
task <- mlr3::TaskClassif$new(id = "my_data",
backend = data,
target = "dependant", # target variable
positive = "1")
classifier_log_reg$train(task, row_ids = 1:100)
I can get the residuals from the model using:
residuals <- classifier_log_reg$model$residuals
My question is: how are these residuals calculated? I cannot reproduce them manually. They don't match the numbers I get when I calculate pearson or deviance residuals using the functions below:
pearson_residuals <- function(p, actual) {
# Standard deviation of the predicted binomial distribution
std_dev <- sqrt(p * (1 - p))
# Avoid division by zero in case of p values being 0 or 1
std_dev[std_dev == 0] <- .Machine$double.eps
# Calculate the Pearson residuals
residuals <- (actual - p) / std_dev
return(residuals)
}
deviance_residuals <- function(p, actual) {
# Ensure p is within valid range to avoid log(0) issues
p <- ifelse(p == 0, .Machine$double.eps, ifelse(p == 1, 1 - .Machine$double.eps, p))
# Calculate the deviance residuals
residuals <- sign(actual - p) * sqrt(-2 * (actual * log(p) + (1 - actual) * log(1 - p)))
return(residuals)
}
What I have found, strangely, is that the residuals from classifier_log_reg$model$residuals
do appear to correspond systematically to residuals that I can calculate manually as the simple difference between the predicted and actual values of the dependant variable. Note that I have adjusted both my manual calculations and the residuals outputted by the model object to best illustrate the apparent sigmoid relationship:
#get residuals directly from the model object
residuals <- classifier_log_reg$model$residuals
#####calculate residuals manually
#specify that predictions should be continuous
classifier_log_reg$predict_type <- 'prob'
#get the predictions
predictions <- classifier_log_reg$predict(task, row_ids = 1:100)
#isolate the vector containing the predictions
predictions <- predictions$data$prob %>% as.data.frame() %>% pull(1)
#subtract predictions from actual values of dependant variable
actual <- data$dependant %>% as.character() %>% as.numeric()
my_resid <- actual - predictions
#put the residuals from the model, the manually calculated residuals
#and the actual values into a dataframe.
#I have adjusted them a bit to illustrate the (apparent) sigmoid relationship that
#emerges after these adjustments.
df <- data.frame(
x = residuals - (actual * 2) + 1,
y = (my_resid + 1) / 2,
actual = actual
)
#plot the relationship between the manually calculated residuals (with adjustment)
#and the residuals straight from the model (with adjustment).
#The curve is completely smooth, but I cannot find the function linking x to y
ggplot(df) + geom_point(aes(x = x, y = y))
As can be seen, there seems to be a sigmoid relationship here. However, I have tried using the nls
function to get the parameters of the best fitting sigmoid curve linking x and y...and it doesn't fit well at all! The below is what I tried; I have not pasted the plot that is produced, but suffice to say that it does not show a straight line (which is what I would expect if the relationship between x and y really is sigmoid):
sigmoid <- function(x, L, k, x0) {
L / (1 + exp(-k * (x - x0)))
}
model <- nls(y ~ sigmoid(x, L, k, x0),
data = df,
start = list(L = 1, k = 1, x0 = 1),
control = nls.control(maxiter = 100))
df$fitted <- predict(model, df)
ggplot(df) + geom_point(aes(x = fitted, y = y))
So what IS the relationship between x and y here? And more to the point, how are the residuals from the mlr3 logistic regression model being calculated under the hood?
Probably the same way that the glm
command calculates them:
glm1 <- glm(dependant~predictor, family="binomial", data=data)
identical(residuals, glm1$residuals)
# [1] TRUE
And the type of residuals that glm
calculates are the "working" residuals, as mentioned in the docs (see ?glm
).
That is, working residuals = where are the working responses and is the linear predictor.
Some useful links: