Search code examples
rlistfor-loopdata.tablegam

Is there a R loop function (data.table) to run over 100s of `gam` results without exceeding the memory limit?


Spatial Interpolation using gam

Statement

I am hoping to get many spatial interpolation outputs using Generalised additive models (GAM). There are no problems for predicting a single pollution map, however, I need more than 100 maps. If possible I would like to automate the implementation and also get the results without exceeding the memory limit.


Spatial Interpolation process with GAM (mgcv package)

Just to let you know, here are the essential steps to get a interpolated map.

  • Get the X, Y coordinates of the pollution monitoring stations
  • Get the pollution data for each station
  • Add the pollution data to the data frame that contains X, Y coordinates
  • Run gam(pollution ~ s(X,Y, k=20)) for each pollution column
  • Create an empty dataframe with min and max X, Y coordinates as a spatial extent
  • Predict the spatial extent using predict and gam result
  • Run the same job over all pollution fields


I will show a hands-on example of how I approached it.


Sample data

To give an example, I created a dataset which is shown below. From the df, you would realise that I have X Y, and 3 pollution variables.

library(data.table)
library(mgcv)

X <- c(197745.8,200443.8,200427.6,208213.4,203691.1,208303.0,202546.4,202407.9,202564.8,194095.5,194508.0,195183.8,185432.5,
       190249.0,190927.0,197490.1,193551.5,204204.4,199508.4,210201.4,212088.3,191886.5,201045.2,187321.7,205987.0)
Y <- c(451633.1,452496.8,448949.5,449753.3,449282.2,453928.5,452923.2,456347.9,461614.8,456729.3,453019.7,450039.7,449472.0,
       444348.1,447274.4,442390.0,443101.2,446446.5,445008.5,446765.2,449508.5,439225.3,460915.6,447392.0,461985.3)
poll1 <- c(34,29,29,33,33,38,35,30,41,43,35,34,41,41,40,36,35,27,53,40,37,32,28,36,33)
poll2 <- c(27,27,34,30,38,36,36,35,37,39,35,33,41,42,40,34,38,31,43,46,38,32,29,33,34)
poll3 <- c(26,30,27,30,37,41,36,36,35,35,35,33,41,36,38,35,34,24,40,43,36,33,30,32,36)

df <- data.table(X, Y, poll1, poll2, poll3)


How I worked on it

1. Hard code

If you look at the code below, you would realised I copy&pasted the same job to all variables. This will be extremely hard to implement a lot of variables.

# Run gam
gam1 <- gam(poll1 ~ s(X,Y, k=20), data = df)
gam2 <- gam(poll2 ~ s(X,Y, k=20), data = df)
gam3 <- gam(poll3 ~ s(X,Y, k=20), data = df)
         # "there are over 5000 variables that needs looping


# Create an empty surface for prediction
GAM_poll <- data.frame(expand.grid(X = seq(min(df$X), max(df$X), length=200),
                                   Y = seq(min(df$Y), max(df$Y), length=200)))


# Predict gam results to the empty surface
GAM_poll$gam1 <- predict(gam1, GAM_poll, type = "response")
GAM_poll$gam2 <- predict(gam2, GAM_poll, type = "response")
GAM_poll$gam3 <- predict(gam3, GAM_poll, type = "response")


2. Using for Loop

Instead, I made a list and attempted to loop all the variables to get a results. It certainly has no problem per se, but iterating over a multiple variables will take up all the memory (this is what I experienced).

# Run gam using list and for loop
myList <- list()

for(i in 3:length(df)){
  myList[[i-2]] <- gam(df[[i]] ~ s(X,Y, k=20), data = df)
}


# Create an empty surface for prediction
GAM_poll <- data.frame(expand.grid(X = seq(min(df$X), max(df$X), length=200),
                                   Y = seq(min(df$Y), max(df$Y), length=200)))


# Predict gam results to the empty surface
myResult <- list()

for(j in 1:length(myList)){
myResult[[j]] <- predict(myList[[j]], GAM_poll, type = "response")
}

Asking for help

  • Is there a better way to get the gam results over multiple variables?
  • Is there a way to not exceed the memory limit during the implementation?

Can you help me data.table, purrr users?


Solution

  • The solution I created only keeps the latest prediction in memory and saves the others to disk before overwriting it with the next solution. The files are named after the column name of the model in a folder called results. I also melted the data.table, mostly because I think the code is a little clearer that way.

    library(data.table)
    library(mgcv)
    
    X <- c(197745.8,200443.8,200427.6,208213.4,203691.1,208303.0,202546.4,202407.9,202564.8,194095.5,194508.0,195183.8,185432.5,
           190249.0,190927.0,197490.1,193551.5,204204.4,199508.4,210201.4,212088.3,191886.5,201045.2,187321.7,205987.0)
    Y <- c(451633.1,452496.8,448949.5,449753.3,449282.2,453928.5,452923.2,456347.9,461614.8,456729.3,453019.7,450039.7,449472.0,
           444348.1,447274.4,442390.0,443101.2,446446.5,445008.5,446765.2,449508.5,439225.3,460915.6,447392.0,461985.3)
    poll1 <- c(34,29,29,33,33,38,35,30,41,43,35,34,41,41,40,36,35,27,53,40,37,32,28,36,33)
    poll2 <- c(27,27,34,30,38,36,36,35,37,39,35,33,41,42,40,34,38,31,43,46,38,32,29,33,34)
    poll3 <- c(26,30,27,30,37,41,36,36,35,35,35,33,41,36,38,35,34,24,40,43,36,33,30,32,36)
    
    df <- data.table(X, Y, poll1, poll2, poll3)
    
    
    # melt the data.table
    df <- melt.data.table(df, id.vars = c('X', 'Y'))
    
    dir.create('results')
    gam1 <- list()
    for(i in unique(df$variable)){
    
      gam1[[i]] <- gam(value ~ s(X,Y, k=20), data = df[variable == i])
    
      GAM_poll <- data.table(expand.grid(X = seq(min(df$X), max(df$X), length=200),
                                         Y = seq(min(df$Y), max(df$Y), length=200)))
    
    
      GAM_poll[, 'prediction' := predict(gam1[[i]], GAM_poll, type = "response")]
    
      write.csv(GAM_poll$prediction, paste('results/model_', i, '.csv'), row.names = FALSE)
    
    }