Search code examples
rrandom-forestmlr

Is there an R function to combine the results of 2 training data sets?


I have a 2.2 Million row dataset. RandomForest throws an error if I have a training data set with more than 1 000 000 rows. So I split the data sets in two pieces and the models learn seperately. How do I combine() the models so I can make a predicition with both of their knowledge?

rtask <- makeClassifTask(data=Originaldaten,target="geklaut")
set.seed(1)

ho = makeResampleInstance("CV",task=rtask, iters = 20)
rtask.train = subsetTask(rtask, subset = 1:1000000)
rtask.train2 = subsetTask(rtask, subset = 1000001:2000000)
rtask.test = subsetTask(rtask, subset = 2000000:2227502)

rlearn_lm <- makeWeightedClassesWrapper(makeLearner("classif.randomForest"), wcw.weight = 0.1209123724417812)

param_lm <- makeParamSet(
  makeIntegerParam("ntree", 500, 500),
  makeLogicalParam("norm.votes", FALSE, FALSE),
  makeLogicalParam("importance", TRUE, TRUE),
  makeIntegerParam("maxnodes" ,4,4)
)

tune_lm <- tuneParams(rlearn_lm,
                  rtask.train,
                  cv5,  #kreuzvalidierung 5-fach
                  mmce, #fehler
                  param_lm, 
                  makeTuneControlGrid(resolution=5)) #wertebereiche

rlearn_lm <- setHyperPars(rlearn_lm,par.vals = tune_lm$x)

model_lm <- train(rlearn_lm,rtask.train)
model_lm2 <- train(rlearn_lm,rtask.train2)
modelGesamt <- combine(model_lm$,model_lm2)

EDIT

you guys are right. actually reading my own code helped me a lot. I have a working resampling here for anyone interested in the future

ho = makeResampleInstance("CV",task=rtask, iters = 20)  
rtask.train = subsetTask(rtask,ho$train.inds[[1]])
rtask.test = subsetTask(rtask,ho$test.inds[[1]] )

Solution

  • This is not possible and you should also not do it. Train one model, even if it takes longer.

    Models can't be merged to fusion their knowledge if they were trained on different datasets.