gam
Statement
I am hoping to get many spatial interpolation outputs using Generalised additive models (GAM). There are no problems for predicting a single pollution map, however, I need more than 100 maps. If possible I would like to automate the implementation and also get the results without exceeding the memory limit.
Spatial Interpolation process with GAM (mgcv
package)
Just to let you know, here are the essential steps to get a interpolated map.
gam(pollution ~ s(X,Y, k=20))
for each pollution columnmin
and max
X, Y coordinates as a spatial extentpredict
and gam
resultI will show a hands-on example of how I approached it.
To give an example, I created a dataset which is shown below. From the df
, you would realise that I have X
Y
, and 3 pollution variables.
library(data.table)
library(mgcv)
X <- c(197745.8,200443.8,200427.6,208213.4,203691.1,208303.0,202546.4,202407.9,202564.8,194095.5,194508.0,195183.8,185432.5,
190249.0,190927.0,197490.1,193551.5,204204.4,199508.4,210201.4,212088.3,191886.5,201045.2,187321.7,205987.0)
Y <- c(451633.1,452496.8,448949.5,449753.3,449282.2,453928.5,452923.2,456347.9,461614.8,456729.3,453019.7,450039.7,449472.0,
444348.1,447274.4,442390.0,443101.2,446446.5,445008.5,446765.2,449508.5,439225.3,460915.6,447392.0,461985.3)
poll1 <- c(34,29,29,33,33,38,35,30,41,43,35,34,41,41,40,36,35,27,53,40,37,32,28,36,33)
poll2 <- c(27,27,34,30,38,36,36,35,37,39,35,33,41,42,40,34,38,31,43,46,38,32,29,33,34)
poll3 <- c(26,30,27,30,37,41,36,36,35,35,35,33,41,36,38,35,34,24,40,43,36,33,30,32,36)
df <- data.table(X, Y, poll1, poll2, poll3)
1. Hard code
If you look at the code below, you would realised I copy&pasted the same job to all variables. This will be extremely hard to implement a lot of variables.
# Run gam
gam1 <- gam(poll1 ~ s(X,Y, k=20), data = df)
gam2 <- gam(poll2 ~ s(X,Y, k=20), data = df)
gam3 <- gam(poll3 ~ s(X,Y, k=20), data = df)
# "there are over 5000 variables that needs looping
# Create an empty surface for prediction
GAM_poll <- data.frame(expand.grid(X = seq(min(df$X), max(df$X), length=200),
Y = seq(min(df$Y), max(df$Y), length=200)))
# Predict gam results to the empty surface
GAM_poll$gam1 <- predict(gam1, GAM_poll, type = "response")
GAM_poll$gam2 <- predict(gam2, GAM_poll, type = "response")
GAM_poll$gam3 <- predict(gam3, GAM_poll, type = "response")
2. Using for
Loop
Instead, I made a list and attempted to loop all the variables to get a results. It certainly has no problem per se, but iterating over a multiple variables will take up all the memory (this is what I experienced).
# Run gam using list and for loop
myList <- list()
for(i in 3:length(df)){
myList[[i-2]] <- gam(df[[i]] ~ s(X,Y, k=20), data = df)
}
# Create an empty surface for prediction
GAM_poll <- data.frame(expand.grid(X = seq(min(df$X), max(df$X), length=200),
Y = seq(min(df$Y), max(df$Y), length=200)))
# Predict gam results to the empty surface
myResult <- list()
for(j in 1:length(myList)){
myResult[[j]] <- predict(myList[[j]], GAM_poll, type = "response")
}
gam
results over multiple variables?Can you help me data.table
, purrr
users?
The solution I created only keeps the latest prediction in memory and saves the others to disk before overwriting it with the next solution. The files are named after the column name of the model in a folder called results. I also melted the data.table, mostly because I think the code is a little clearer that way.
library(data.table)
library(mgcv)
X <- c(197745.8,200443.8,200427.6,208213.4,203691.1,208303.0,202546.4,202407.9,202564.8,194095.5,194508.0,195183.8,185432.5,
190249.0,190927.0,197490.1,193551.5,204204.4,199508.4,210201.4,212088.3,191886.5,201045.2,187321.7,205987.0)
Y <- c(451633.1,452496.8,448949.5,449753.3,449282.2,453928.5,452923.2,456347.9,461614.8,456729.3,453019.7,450039.7,449472.0,
444348.1,447274.4,442390.0,443101.2,446446.5,445008.5,446765.2,449508.5,439225.3,460915.6,447392.0,461985.3)
poll1 <- c(34,29,29,33,33,38,35,30,41,43,35,34,41,41,40,36,35,27,53,40,37,32,28,36,33)
poll2 <- c(27,27,34,30,38,36,36,35,37,39,35,33,41,42,40,34,38,31,43,46,38,32,29,33,34)
poll3 <- c(26,30,27,30,37,41,36,36,35,35,35,33,41,36,38,35,34,24,40,43,36,33,30,32,36)
df <- data.table(X, Y, poll1, poll2, poll3)
# melt the data.table
df <- melt.data.table(df, id.vars = c('X', 'Y'))
dir.create('results')
gam1 <- list()
for(i in unique(df$variable)){
gam1[[i]] <- gam(value ~ s(X,Y, k=20), data = df[variable == i])
GAM_poll <- data.table(expand.grid(X = seq(min(df$X), max(df$X), length=200),
Y = seq(min(df$Y), max(df$Y), length=200)))
GAM_poll[, 'prediction' := predict(gam1[[i]], GAM_poll, type = "response")]
write.csv(GAM_poll$prediction, paste('results/model_', i, '.csv'), row.names = FALSE)
}