I have a data set that I would like to stratify sample, create statistical models on using the caret
package and then generate predictions.
The problem I am finding is that in different iterations of the stratified data set I get significantly different results (this may be in part due to the relatively small data sample M=1000
).
What I want to be able to do is:
1000
times & take the average model outputI hope that by repeating the steps on the variations of the stratified data set, I am able to avoid the subtle changes in the predictions generated due to a smaller data sample.
For example, it may look something like this in r;
Original.Dataset = data.frame(A)
Stratified.Dataset = stratified(Original.Dataset, group = x)
Model = train(Stratified.Dataset.....other model inputs)
Repeat process with new stratified data set based on the original data and average out.
Thank you in advance for any help, or package suggestions that might be useful. Is it possible to stratify the sample in caret or simulate in caret?
First of all, welcome to SO.
It is hard to understand what you exactly are wondering, your question is very broad.
If you need input on statistics I would suggest you to ask more clearly defined questions in Cross Validated. Q&A for people interested in statistics, machine learning, data analysis, data mining, and data visualization.
The problem I am finding is that in different iterations of the stratified data set I get significantly different results (this may be in part due to the relatively small data sample M=1000).
I assume you are referring to different iterations of your model. This depends on how large your different groups are. E.g. if you are trying to divide your data set consisting of 1000 samples in to groups of 10 samples, your model could very likely be unstable and hence give different results in each iteration. This could also be due to that your model depends on some randomness, and the smaller your data is (and the more groups) your will have larger variation. See here or here for more information on cross validation, stability and bootstrap aggregating.
- Generate the stratified data sample
How to generate it: the dplyr
package is excellent in grouping data depending on different variables. You might also want to use the split
function found in the base
package. See here for more information. You could also use the in-built methods found in the caret
package, found here.
How to know how to split it: it very much depends on your question you would like to answer, most likely you would like to even out some variables, e.g. gender and age for creating a model for predicting disease. See here for more info.
In the case of having e.g. duplicated observations and you want to create unique subsets with different combinations of replicates with it's unique measurements you would have to use other methods. If the replicates have a common identifier, here sample_names
. You could do something like this to select all samples but with different combinations of the replicates:
tg <- data.frame(sample_names = rep(1:5,each=2))
set.seed(10)
tg$values<-rnorm(10)
partition <- lapply(1:100, function(z) {
set.seed(z)
sapply(unique(tg$sample_names), function(x) {
which(x == tg$sample_names)[sample(1:2, 1)]
})
})
#the first partition of your data to train a model.
tg[partition[[1]],]
- Create the machine learning model
If you want to use caret
, you could go to the caret webpage. And see all the available models. Depending on your research question and/or data you would like to use different types of models. Therefore, I would recommend you to take some online machine learning courses, for instance the Stanford University course given by Andrew Ng (I have taken it myself), to get more familiar with the different major algorithms.If you are familiar with the algorithms, just search for the available models.
- Repeat 1000 times & take the average model output
You can either repeat your model 1000 times with different seeds (see set.seed
) and different training methods e.g. cross validations or bootstrap aggregation. There are a lot of different training parameters in the caret
package:
The function
trainControl
generates parameters that further control how models are created, with possible values:method: The resampling method: "boot", "cv", "LOOCV", "LGOCV", "repeatedcv", "timeslice", "none" and "oob"
For more information on the methods, see here.