Proper GAM specification in the case of spatial autocorrelation

I am working to understand the effects of spatial predictors (e.g., tree canopy cover, impervious surface cover) on air temperature anomalies in cities. We, and others before, do this with GAMs.

In both of these papers, spatial autocorrelation is "accounted for" by including a smooth term for longitude and latitude such as s(lon,lat). The idea being (as previously discussed here) that this reduces spatial dependence among the residuals and also, perhaps, facilitates more realistic estimates of the fixed effects of interest.

I am using mgcv::gam and mgcv::gamm. I am having trouble deciding whether including this s(lon,lat) term is advisable or not. It has extremely high concurvity with fixed effects like elevation and distance_from_water. Moreover, it seems to really "mess up the fixed effect [I] love" (Hodges and Reich, 2010). For instance, there are visibly obvious cases where air temperature is altered by tree cover, but the effect is largely removed because tree canopy varies along a latitudinal gradient (by chance, let's say).

Which of these three (somewhat simplified) models seems most appropriate and what are the clues you would use to come to that conclusion? Note: n=20,000 temperature measurements throughout a large city. We systematically model using 1 to 5% of the points at a time to take a first crack at reducing autocorrelation.

Only fixed effects:

t_anom1 ~ gam(data, s(tcf_90, k=3) + s(imp_90, k=3))
Fixed effects with s(lon, lat):

t_anom2 ~ gam(data, s(tcf_90, k=3) + s(imp_90, k=3)) + s(lon,lat, k=27)
Fixed effects in GAMM to allow modeling of the correlation structure in the residuals:

t_anom3 ~ gamm(data, s(tcf_90, k=3) + s(imp_90, k=3)), correlation=corExp(form = ~ lat + lon))

Where tcf_90 is tree canopy fraction within 90 meters and imp_90 is impervious surface fraction within 90 meters.

Results

Unless I'm misinterpreting, using BIC() on the three models suggests that #3 is far-and-away the "best" model:

                     df       BIC
t_anom1        5.987865 2781.9525
t_anom2       30.782511 1254.3167
t_anom3$lme    7.000000  237.0072

Here are the summaries of the three models:

Model #1 (t_anom1):

Family: gaussian 
Link function: identity 

Formula:
anomaly ~ s(tcf_90, k = 3) + s(imp_90, k = 3)

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.84471    0.01189  -71.04   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Approximate significance of smooth terms:
            edf Ref.df     F p-value    
s(tcf_90) 1.966      2 79.16  <2e-16 ***
s(imp_90) 1.889      2 97.20  <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

R-sq.(adj) =  0.338   Deviance explained =   34%
-REML = 1382.1  Scale est. = 0.26003   n = 1839

Model #2 (t_anom2):

Family: gaussian 
Link function: identity 

Formula:
anomaly ~ s(tcf_90, k = 3) + s(imp_90, k = 3) + s(lon, lat, k = 27)

Parametric coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.844706   0.007512  -112.5   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Approximate significance of smooth terms:
               edf Ref.df      F p-value    
s(tcf_90)   1.9182      2  78.63  <2e-16 ***
s(imp_90)   0.9935      2  72.69  <2e-16 ***
s(lon,lat) 25.3636     26 107.47  <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

R-sq.(adj) =  0.736   Deviance explained =   74%
-REML = 599.61  Scale est. = 0.10377   n = 1839

Model t_anom3:

Family: gaussian 
Link function: identity 

Formula:
anomaly ~ s(tcf_90, k = 3) + s(imp_90, k = 3)

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   -1.154    281.317  -0.004    0.997

Approximate significance of smooth terms:
            edf Ref.df      F  p-value    
s(tcf_90) 1.362  1.362  3.633   0.0522 .  
s(imp_90) 1.856  1.856 14.417 2.46e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

R-sq.(adj) =  0.179   
  Scale est. = 79144     n = 1839

So, we see huge differences in variance explained and significance of the smooths across models. Model #2 has a high r-squared because of the inclusion of a k=27 smoothed lon,lat surface but more importantly the tcf and imp smooths are significant and (seemingly) useful. As opposed to model #3 where r-squared is very low and tcf is only barely significant.

Finally, for each of these models' residuals, Global Moran's I (spdep::moran.test) was always > 0.6 and significant at the p<0.00001 level. This is somewhat concerning, particularly for model #3. Running semivariograms on the residuals using gstat::variogram shows that model #1 has spatial dependence out to about 3500 m while model 2 semivariance peaks at 1500 m. Interestingly, model 3 semivariance plot is exactly the same as model 1, maybe indicating that i'm doing something wrong specifying the correlation structure.

Do I need to go with model #3 here because of BIC? Are there other diagnostics I should be aware of?

Solution

I don't think model 3 and model 1&2 are actually comparable via BIC; the likelihood model used by gam vs. gamm is probably calculating what the "likelihood" of this data is in very different ways. In general, I would not use AIC / BIC metrics to compare models estimated using different model fitting criteria; they are only useful for comparing models estimated using the same function, and on the same data set. That's because different modelling frameworks use different marginalizations of the likelihood function that may not be comparable between frameworks.

For example, the code below shows the same model estimated in three ways: a GAM with a Gaussian Process ("gp") using an exponential covariance model estimated via the gam function, the same model estimated via gamm, and the same model estimated with gamm, except with the spatial smoother moved into the correlation function:

library(mgcv)
## simple examples using gamm as alternative to gam
set.seed(0) 
dat <- gamSim(1,n=200,scale=2)

#standard gam with a Gaussian process smoother with an exponential covariance matrix
b_gam <- gam(y~s(x0,x1, bs = "gp", m= c(2,1), k =150)+s(x2)+s(x3),data=dat, method= "REML") 

#Same as above, but fit using gamm
b_gamm <- gamm(y~s(x0,x1, bs = "gp", m= c(2,1), k= 150)+s(x2)+s(x3),data=dat) 

#Same as above, but estimating the correlation structure using corExp rather
#than a smoother
b_gamm2 <- gamm(y~s(x2)+s(x3),data=dat,correlation=corExp(form = ~ x0 + x1) ) 

BIC(b_gam, b_gamm$lme,b_gamm2$lme)

gives you this BIC table:

                  df       BIC
b_gam       28.37651  992.1030
b_gamm$lme   9.00000  962.0534
b_gamm2$lme  7.00000 1015.0635

even though the estimated smooth functions are practically indistinguishable from one another:

par(mfrow = c(3,3))
plot(b_gam,scheme=2)
plot(b_gamm$gam,scheme=2)
plot(b_gamm2$gam,scheme=2)

Similarly, R^2 is not directly comparable between gamm and gam models, as the gam model is estimating a spatial smooth, so variation in space is treated as part of the variation to be explained, whereas in your gamm model, spatial variation is modelled as part of residual variation, so is treated like "unexplained" variation.

For your specific problem, there's a few options for including spatial autocorrelation when your predictors are heavily co-linear with your predictors. None of these are perfect, and this is still an area of active research, so I can't recommend a single perfect fix for your case; all I can recommend are possible approaches to try.

Limiting the degrees of freedom for the spatial smooth (setting it to something like k=10) so it is strictly acting to model large-scale spatial trends This can work well if you are interested in estimating the smaller-scale effects of varying tree cover etc. on temperature, and vegetation cover / impervious cover do not show very large-scale spatial gradients (e.g. the entire study area has a gradual gradient from low to high tree cover).
Not including a spatial covariate at all, but reducing the effective sample size of the data. One effect of spatial autocorrelation on inference is that spatially autocorrelated data effectively has lower degrees of freedom to estimate a model than would be implied by statistically independent locations. You can decrease the effective sample size of your data in the gam() function by specifying the gamma parameter. Setting gamma = 2 cuts your effective sample size in half, gamma=3 cuts it by 3, etc. The net result is to give you less wiggly estimates for your main effects, without having to also estimate a spatial smoother. Unfortunately, I don't have any good guidelines on how to set effective sample size.
Use block-cross validation to estimate your model. In this strategy, you spatially stratify your data (e.g. into city blocks, or larger regions) then hold out each stratum in term when fitting the model, then compare the model predictions for the stratum to observed values for in this paper we re-estimated the same model for each 100th observation, then calculated average curves for all 100 separate models. A more standardized approach to block cross-validation has been implemented in mgcv using the "NCV" method for fitting gams, but that method requires you to specify the spatial blocking structure you want to use. It's a new approach, and one I am still learning about myself.