I have been running a GLMM using the glmmTMB package on a large dataset. Example data below
structure(list(code = structure(c(1L, 1L, 1L, 1L), .Label = c("2388",
"2390", "12950", "12952", "12954", "12956", "12958", "12960",
"12962", "12964", "12966", "12968", "13573", "13574", "13575",
"13576", "13577", "14203", "19318", "19319", "19320", "19321",
"19322", "19515", "19517", "19523", "19524", "25534", "25535",
"25536", "25537", "25539", "25540", "25541", "25542", "25543",
"25544", "25545", "25546", "25547", "25548", "25549", "25550",
"25552", "25553", "27583", "27584", "27585", "27586", "27588",
"27589", "27590", "27591", "27592", "27593", "27594", "27595",
"27596", "27597", "27598", "27599", "27600", "27601", "27602",
"27603", "27604", "27605", "27606", "27607", "27608", "27609",
"27610", "27611", "27613", "27614", "27615", "27616", "27617",
"27618", "27619", "27620", "27621", "27622", "27624", "27625",
"27626", "27627", "27629", "27630", "27631", "27632", "34176",
"34177", "34178", "34179", "52975", "52977", "52978", "54814",
"54815", "54816", "54817", "54821", "54822", "54823", "54824",
"54825", "54835", "54837", "54838", "54839", "54840", "54841",
"54842", "54843", "54844", "54845", "54846", "54847", "54848",
"54849", "54851", "54852", "54853", "54856", "54858", "54859",
"54860", "54863", "54864", "54866", "54869", "54872", "54873",
"54874", "54875", "54876", "54877", "54878", "54880", "54882",
"54883", "54884", "54886", "54887", "54889", "54890", "54892",
"54893", "54895", "54896", "54898", "54899", "54900", "54901",
"54903", "54904", "54905", "54906", "54911", "54912", "54914",
"54915", "54931", "54933", "54934", "54935", "54937", "54939",
"54940", "54941", "54942", "54943", "54944", "54945", "54946",
"54947", "54948", "54950", "54952", "54954", "54955", "54957",
"54958", "54959", "54961", "54962"), class = "factor"), station =
c("PB14",
"PB14", "PB16", "PB16"), species = c("Silvertip Shark", "Silvertip Shark",
"Silvertip Shark", "Silvertip Shark"), sex = c("F", "F", "F",
"F"), size = c(112, 112, 112, 112), datetime = c("1466247120",
"1466247420", "1467026100", "1469621400"), year = c("2016", "2016",
"2016", "2016"), month = c(6, 6, 6, 7), hour = c("11", "11",
"12", "13"), season = c("dry season", "dry season", "dry season",
"dry season"), daynight = c("day", "day", "day", "day"), time_diff = c(4,
5, 5821, 43255), offshore = structure(c(2L, 2L, 1L, 1L), .Label =
c("offshore",
"onshore"), class = "factor"), rowN = 1:4), row.names = c(NA,
4L), class = "data.frame")
I'm looking to run the model on 80% of my data then validate it using the predict() function. Using the code below
Off_80 <- Off %>% sample_frac(.80)
Off_20 <- anti_join(Off, Off_80, by = 'rowN')
OffMod_80 <- glmmTMB(offshore ~ sex + log(size) + species*daynight + species*season + (1|code), family=binomial(), data=Off_80)
pred_Off20 <- as.data.frame(predict(OffMod_80, newdata=Off_20, type="response", allow.new.levels=TRUE))
The idea being I would then compare the predicted results to my observed results to validate the strength of the model.
However using this rather than getting a 'offshore' or 'onshore' response, I get a numerical value.
predict()
1 0.2807461
2 0.2631816
3 0.2631816
4 0.2807461
Is there anyway to get the predict function to spit out a binomial response? Or how do I interpret these values as binomial?
Initially I had my response variable as a 1 or a 0, but following this post I changed the values to factors. But still predict() spits out a numerical value.
Any help appreciated!
predict()
for binomial models returns the probability of a success or failure, it does not return 1 or 0 (because you can predict that outcome only with a certain probability). So if you want to check your model performance, you can try to calculate the area under curve:
library(glmmTMB)
library(pROC)
data(mtcars)
n <- nrow(mtcars)
train <- sample(1:n, n * .8, TRUE)
mtcars_train <- mtcars[train, ]
mtcars_test <- mtcars[-train, ]
m <- glmmTMB(formula = vs ~ hp + wt + (1 | gear), family = binomial, data = mtcars_train)
p <- predict(m, newdata = mtcars_test)
auc(roc(response = mtcars_test$vs, predictor = p))
#> Area under the curve: 0.9107
Created on 2019-05-29 by the reprex package (v0.3.0)
Another way is to compare the percentage of correct predictions (PCP) of the full model with the null model (see Herron, M. (1999). Postestimation Uncertainty in Limited Dependent Variable Models. Political Analysis, 8, 83–98). Here, the PCP of the full model should be remarkably higher:
library(glmmTMB)
library(insight)
data(mtcars)
m <- glmmTMB(formula = vs ~ hp + wt + (1 | gear), family = binomial, data = mtcars)
m0 <- glmmTMB(formula = vs ~ 1 + (1 | gear), family = binomial, data = mtcars)
y <- insight::get_response(m)
y0 <- insight::get_response(m0)
n <- nobs(m)
n0 <- nobs(m0)
p <- predict(m, type = "response")
p0 <- predict(m0, type = "response")
pcp_full <- (sum(1 - p[y == 0]) + sum(p[y == 1])) / n
pcp_null <- (sum(1 - p0[y0 == 0]) + sum(p0[y0 == 1])) / n0
# percentage correct predictions full model
pcp_full
#> [1] 0.8374718
# percentage correct predictions null model
pcp_null
#> [1] 0.6614221
Created on 2019-05-29 by the reprex package (v0.3.0)
Finally, if you really want something like 0
and 1
, you can use simulate()
, which returns values on the same scale as the original response. Than you can compare how many of the simulated responses match the actual response:
library(glmmTMB)
data(mtcars)
m <- glmmTMB(formula = vs ~ hp + wt + (1 | gear), family = binomial, data = mtcars)
# simulate response, first column = successes
s <- simulate(m)$sim_1[, 1]
# proportion of response values that equal simulated responses
mean(mtcars$vs == s)
#> [1] 0.875
Created on 2019-05-29 by the reprex package (v0.3.0)