r machine-learning simulation r-caret sampling

How to stratify sample a data set, conduct statistical analysis with Caret and repeat in r?

I have a data set that I would like to stratify sample, create statistical models on using the caret package and then generate predictions.

The problem I am finding is that in different iterations of the stratified data set I get significantly different results (this may be in part due to the relatively small data sample M=1000).

What I want to be able to do is:

Generate the stratified data sample
Create the machine learning model
Repeat 1000 times & take the average model output

I hope that by repeating the steps on the variations of the stratified data set, I am able to avoid the subtle changes in the predictions generated due to a smaller data sample.

For example, it may look something like this in r;

Original.Dataset = data.frame(A)

Stratified.Dataset = stratified(Original.Dataset, group = x)

Model = train(Stratified.Dataset.....other model inputs)

Repeat process with new stratified data set based on the original data and average out.

Thank you in advance for any help, or package suggestions that might be useful. Is it possible to stratify the sample in caret or simulate in caret?

Solution

First of all, welcome to SO.

It is hard to understand what you exactly are wondering, your question is very broad.

If you need input on statistics I would suggest you to ask more clearly defined questions in Cross Validated. Q&A for people interested in statistics, machine learning, data analysis, data mining, and data visualization.

The problem I am finding is that in different iterations of the stratified data set I get significantly different results (this may be in part due to the relatively small data sample M=1000).

I assume you are referring to different iterations of your model. This depends on how large your different groups are. E.g. if you are trying to divide your data set consisting of 1000 samples in to groups of 10 samples, your model could very likely be unstable and hence give different results in each iteration. This could also be due to that your model depends on some randomness, and the smaller your data is (and the more groups) your will have larger variation. See here or here for more information on cross validation, stability and bootstrap aggregating.

Generate the stratified data sample

How to generate it: the dplyr package is excellent in grouping data depending on different variables. You might also want to use the split function found in the base package. See here for more information. You could also use the in-built methods found in the caret package, found here.

How to know how to split it: it very much depends on your question you would like to answer, most likely you would like to even out some variables, e.g. gender and age for creating a model for predicting disease. See here for more info.

In the case of having e.g. duplicated observations and you want to create unique subsets with different combinations of replicates with it's unique measurements you would have to use other methods. If the replicates have a common identifier, here sample_names. You could do something like this to select all samples but with different combinations of the replicates:

tg <- data.frame(sample_names = rep(1:5,each=2))
set.seed(10)
tg$values<-rnorm(10)

partition <- lapply(1:100, function(z) {
  set.seed(z)
  sapply(unique(tg$sample_names), function(x) {
    which(x == tg$sample_names)[sample(1:2, 1)]
  })
})

#the first partition of your data to train a model.
tg[partition[[1]],]

Create the machine learning model

If you want to use caret, you could go to the caret webpage. And see all the available models. Depending on your research question and/or data you would like to use different types of models. Therefore, I would recommend you to take some online machine learning courses, for instance the Stanford University course given by Andrew Ng (I have taken it myself), to get more familiar with the different major algorithms.If you are familiar with the algorithms, just search for the available models.

Repeat 1000 times & take the average model output

You can either repeat your model 1000 times with different seeds (see set.seed) and different training methods e.g. cross validations or bootstrap aggregation. There are a lot of different training parameters in the caret package:

The function trainControl generates parameters that further control how models are created, with possible values:

method: The resampling method: "boot", "cv", "LOOCV", "LGOCV", "repeatedcv", "timeslice", "none" and "oob"

For more information on the methods, see here.