I have a combined dataset from 3 sites and would like to know how a universal relationship compares to site specific relationships. The plan is a k-fold cross-validation. Based on this cross validated question, I need to proportionally sample from my different sites since they consist of different number of observations. I've done k-fold cv with caret
before:
library(dplyr)
library(caret)
dF=data_frame(y=runif(100,1,6),x1=runif(100),x2=runif(100),site=c(rep('a',20),rep('b',20),rep('c',60)) %>% group_by(site)
train_control<- trainControl(method="repeatedcv", number = 4, repeats = 3, savePredictions = TRUE)
model<- train(y~x1*x2+I(x2^2), data=dF, trControl=train_control, method='glmStepAIC',family=gaussian(link='log'))# no need to preprocess because x1 and x2 both have theoretical values (0,1].
but now haven't figured out how one might alter the partitioning such that the site with more observations isn't unfairly influencing the model skill.
So the end result I'd like is a dataframe of r2 and mean absolute error for sites a, b and c and all the data together. Similarly, I'd like to know the parameters for x1 and x2 in each of the models scenarios.
EDIT
I found downSample
in the caret documentation that I think is supposed to help with this but I keep getting an error. anyone know why this is happening? OSX 10.11.1, R 3.2.2, caret_6.0-58
down_train <- downSample(x = dplyr::select(datadF,-basin), y = as.factor(datadF$basin))
Error in sample.int(length(x), size, replace, prob) :
cannot take a sample larger than the population when 'replace = FALSE'
I ended up writing these functions do what I need:
The function partition_data
I call figures out how many obs there are in each basin, takes the minimum of all the basins and then samples with frac
how many samples should be taken from each basin.
the second helper function essentially just calls createDataPartition
from caret
package for each basin (using split/apply/combine from dplyr
) where perc
is the percent of observations that should be partitioned for fitting that each particular basin.
partition_data = function(dF,frac) {
numobs = dF %>% group_by(basin) %>% summarise(nrw = n()) %>% summarise(frac*min(nrw)) %>% as.numeric
print(numobs)
testdata = dF %>% group_by(basin) %>%
do(site_partition(.,numobs))
}
site_partition = function(dF,numobs) {
perc=numobs/nrow(dF)
print(paste(unique(dF$basin),': perc =', perc))
ind = createDataPartition(dF$snowdepth,
p = perc,
list = FALSE,
times = 1)
return(dF[ind,])
}
datadF=data_frame(y=runif(100,1,6),x1=runif(100),x2=runif(100),site=c(rep('a',20),rep('b',20),rep('c',60)) %>% group_by(site)
testdata = partition_data(datadF,0.6)#fit using this data.
valdata=anti_join(datadF,testdata)#independent validation with this data