Search code examples
spatialcross-validationtidymodels

jackknife (leave-one-out) cross validation with tidySDM and SpatialSample package in R


I am working with very low sample size (10-15) trying to create an ensemble SDM with the package TidySDM. Cross validation won't work with the random k fold method and in these cases of low sample size jackknife or leave one out cross validation is best practice.

TidySDM uses the spatial sample package and what i think is the jackknife function is "spatial_leave_location_out_cv". However it requires a "group" argument that i cannot figure out what to provide for this argument.

Right now I have a an sf dataframe with the following columns: Class - presence vs background locations. Geometry - the lat/lon location data (10010 rows in total,10 for the presences and 10000 for the background points). 7 predictor variables with the values extracted from the rasters

I have tried supplying the class and geometry columns for the group argument. The models failed to run with class as the argument. When i set geometry to the group argument, the models ran for about 8 hours and wasn't even half way through so I terminated the session.

What is the proper way to run this function for jackknife CV? Here is my code if it helps:

dive.cv <- spatial_leave_location_out_cv(data = dive.vars1, group = jackknife, v = NULL) 

autoplot(dive.cv)```

Solution

  • To run a remove-one jacknife with spatial_leave_location_out_cv, you will need to set up a grouping variable, such that each group contains a single presence and an appropriate number of background plots. Here is a simple reprex, using the lacerta dataset in tidysdm, subsetted to just 3 presences and 2 background points per presence.

    library(tidysdm)
    #> Loading required package: tidymodels
    #> Loading required package: spatialsample
    lacerta_thin <- readRDS(system.file("extdata/lacerta_thin_all_vars.rds",
                                        package = "tidysdm"))
    ########
    # create a small dataset for the reprex
    n_pres <- 3 # number of presences
    n_bkg_per_pres <- 2 # number of background points per presence
    set.seed(123)
    lacerta_small <- rbind(lacerta_thin %>% filter(class == "presence") %>% 
                             sample_n(size = n_pres),
                              lacerta_thin %>% filter(class == "background") %>%
                             sample_n(size=n_pres * n_bkg_per_pres))
    ########
    # now create groups, 1 per presence, each with 
    # n_bkg_per_pres background points
    lacerta_small$group <- NA
    lacerta_small$group[lacerta_small$class == "presence"] <- 1:n_pres
    lacerta_small$group[lacerta_small$class == "background"] <- 
      sample(rep(1:n_pres, each = n_bkg_per_pres), replace=FALSE)
    ########
    # set up the folds for the jacknife
    lacerta_cv <- spatial_leave_location_out_cv(data = lacerta_small,
                                                group = group) 
    # confirm that we have the right balance of presence and background points
    check_splits_balance(lacerta_cv, class)
    #> # A tibble: 3 × 4
    #>   presence_assessment background_assessment presence_analysis
    #>                 <int>                 <int>             <int>
    #> 1                   2                     4                 1
    #> 2                   2                     4                 1
    #> 3                   2                     4                 1
    #> # ℹ 1 more variable: background_analysis <int>
    autoplot(lacerta_cv)
    

    Created on 2025-02-28 with reprex v2.1.1

    map