Search code examples
rdplyrr-caretsubsampling

k-fold cross validation with different sample sizes


I have a combined dataset from 3 sites and would like to know how a universal relationship compares to site specific relationships. The plan is a k-fold cross-validation. Based on this cross validated question, I need to proportionally sample from my different sites since they consist of different number of observations. I've done k-fold cv with caret before:

library(dplyr)
library(caret)   
dF=data_frame(y=runif(100,1,6),x1=runif(100),x2=runif(100),site=c(rep('a',20),rep('b',20),rep('c',60)) %>% group_by(site)
train_control<- trainControl(method="repeatedcv", number = 4, repeats = 3, savePredictions = TRUE)
model<- train(y~x1*x2+I(x2^2), data=dF, trControl=train_control, method='glmStepAIC',family=gaussian(link='log'))# no need to preprocess because x1 and x2 both have theoretical values (0,1].

but now haven't figured out how one might alter the partitioning such that the site with more observations isn't unfairly influencing the model skill.

So the end result I'd like is a dataframe of r2 and mean absolute error for sites a, b and c and all the data together. Similarly, I'd like to know the parameters for x1 and x2 in each of the models scenarios.

EDIT I found downSample in the caret documentation that I think is supposed to help with this but I keep getting an error. anyone know why this is happening? OSX 10.11.1, R 3.2.2, caret_6.0-58

down_train <- downSample(x = dplyr::select(datadF,-basin), y = as.factor(datadF$basin))
Error in sample.int(length(x), size, replace, prob) : 
  cannot take a sample larger than the population when 'replace = FALSE'

Solution

  • I ended up writing these functions do what I need:

    The function partition_data I call figures out how many obs there are in each basin, takes the minimum of all the basins and then samples with frac how many samples should be taken from each basin.

    the second helper function essentially just calls createDataPartition from caret package for each basin (using split/apply/combine from dplyr) where perc is the percent of observations that should be partitioned for fitting that each particular basin.

    partition_data = function(dF,frac) {
        numobs = dF %>% group_by(basin) %>% summarise(nrw = n()) %>% summarise(frac*min(nrw)) %>% as.numeric
        print(numobs)
        testdata = dF %>% group_by(basin) %>%
            do(site_partition(.,numobs))
    }
    
    site_partition = function(dF,numobs) { 
    perc=numobs/nrow(dF)    
    print(paste(unique(dF$basin),': perc =', perc))     
    ind = createDataPartition(dF$snowdepth,
                            p = perc,
                            list = FALSE,
                            times = 1)  
    return(dF[ind,]) 
    }
    
    
    datadF=data_frame(y=runif(100,1,6),x1=runif(100),x2=runif(100),site=c(rep('a',20),rep('b',20),rep('c',60)) %>% group_by(site)
    testdata = partition_data(datadF,0.6)#fit using this data.
    valdata=anti_join(datadF,testdata)#independent validation with this data