I have a small dataset of people, thier locations, and whether or not they know each other. It is a subset of a dataset with 1000 people. Given that each person could know any other person, the number of potential links grows just under n^2. I want to fit a model with the small subset, to get probability of linkage as a function of distance, so that I can perform simulations with the wider dataset.
I've got two problems:
The following is an example of my problem, generating the interval using the technique from R-sig-mixed. Be warned that the last step will throw an error about not being able to allocate a huge vector, unless you're on a really impressive machine.
#Some fake location data
set.seed(13)
x = runif(50)*2
y = runif(50)*2
d = cbind(ID = 1:50,as.matrix(dist(data.frame(x,y))))
I want to model links as a function of distance. More fake data:
library(reshape)
mdata <- melt(as.data.frame(d), id=c("ID"),measure.vars = colnames(d)[2:ncol(d)],variable.name="distance")
mdata$popularity = rnorm(25,sd=.3)
colnames(mdata)[colnames(mdata)=="variable"] = "knows"
colnames(mdata)[colnames(mdata)=="value"] = "distance"
mdata = subset(mdata,ID!=knows)
a = exp(1/(mdata$distance/runif(nrow(mdata))^mdata$distance)+mdata$popularity+rnorm(nrow(mdata),sd=.001))
mdata$prlink = a/(1+a)
with(mdata,plot(distance,prlink))
mdata$link = runif(nrow(mdata))<mdata$prlink
mdata$ID = as.factor(mdata$ID)
mdata$knows = as.factor(mdata$knows)
mdata$dum=1 #this facilitates predicting from the population of the model, later
Now, I model the data:
library(mgcv)
mod = gam(link~s(distance)+s(ID,bs="re",by=dum)+s(knows,bs="re",by=dum),data=mdata,family=binomial(link="logit"))
plot(mod,pages=1)
summary(mod)
Now, I want to apply the fitted model to my main dataset:
x = runif(1000)*2
y = runif(1000)*2
d = cbind(ID = 1:1000,as.matrix(dist(data.frame(x,y))))
mdata <- melt(as.data.frame(d),id.vars = "ID")
colnames(mdata)[colnames(mdata)=="variable"] = "knows"
colnames(mdata)[colnames(mdata)=="value"] = "distance"
mdata = subset(mdata,ID!=knows)
mdata$dum=0; mdata$ID=1; mdata$knows=2 #These are needed for prediction, even though I am predicting from the population of the model, not one of the levels.
Some timekeeping tools...
tic <- function(gcFirst = TRUE, type=c("elapsed", "user.self", "sys.self"))
{
type <- match.arg(type)
assign(".type", type, envir=baseenv())
if(gcFirst) gc(FALSE)
tic <- proc.time()[type]
assign(".tic", tic, envir=baseenv())
invisible(tic)
}
toc <- function()
{
type <- get(".type", envir=baseenv())
toc <- proc.time()[type]
tic <- get(".tic", envir=baseenv())
print(toc - tic)
invisible(toc)
}
tic()
p = predict(mod,newdata=mdata,type="response")
toc()
Just predicting the point estimates takes 31 seconds on my machine. Now to try to get the prediction intervals, first get the design matrix...
tic()
Designmat = predict(mod,newdata=mdata,type="lpmatrix")
toc()
That took me 47 seconds and froze my computer while it was working.
Now here is the technique for getting a prediction interval that I found on R-sig-mixed...
tic()
predvar <- diag(Designmat %*% vcov(mod) %*% t(Designmat))
SE <- sqrt(predvar)
SE2 <- sqrt(predvar+mod$sig2)
tfrac <- qt(0.975, mod$df.residual)
interval = tfrac*SE2
toc()
>Error: cannot allocate vector of size 7435.7 Gb
Is there another way???
You need to avoid calculating Designmat %*% vcov(mod) %*% t(Designmat)
. You only need the diagonal after all. Try this:
tmp <- Designmat %*% vcov(mod)
library(compiler)
diagMult <- cmpfun(function(m1, m2) sapply(seq_len(nrow(m1)),
function(i) m1[i,] %*% m2[,i]))
predvar <- diagMult(tmp, t(Designmat))
(Not tested thoroughly. The function should be implemented with Rcpp to improve speed, if there is not already a compiled version in some package.)