Search code examples
rmachine-learningstatisticsensemble-learning

R - How to create a stacker ensemble?


I need to create a stacker ensemble, do I combine each summarised percentage of accuracy output from each classifier with a new classifier

NBayes

Result = 0.61% accuracy

K-NN (k = 5)

Result = 0.63% accuracy

K-NN (k = 10)

Result = 0.64% accuracy

Decision tree

Result = 0.60% accuracy

Logistic regression

Result = 0.62% accuracy

classify those 5 percentages?

or do I need to combine the output of many predictions e.g. something like a table:

NB   k = 5  k = 10  dectree   Logistic   TrueLabel    
bob    1      1      bob       FALSE       bob
bob    2      2      john      TRUE        john
bob    1      1      bob       TRUE        bob

if this way then does it matter if the outputs are different I.E should they all be either bob or john instead of true or false or 1 or 2?

what classifier should I use to combine them with?


Solution

  • In order to create a stacking ensemble you need to use the table you have created at the end of your question i.e. this:

    NB   k = 5  k = 10  dectree   Logistic   TrueLabel    
    bob    1      1      bob       FALSE       bob
    bob    2      2      john      TRUE        john
    bob    1      1      bob       TRUE        bob
    

    The answer to 'should they all be either bob or john instead of true or false or 1 or 2?' is that it depends on the model you will use to combine the individual models. Most models in r work with factors in which case leaving them as that would be fine. Make sure your first and second columns (that have numeric values) are also treated as factors otherwise they will be treated as numbers and you don't want that (many models will create dummy variables out of a factor and if your column is numeric then this won't happen). To sum up for this use factors for all the above columns but do read the documentation of the combination model (info about this later) to see whether it accepts factors as input.

    For the other question as to what model you need to use to combine the inputs the answer is: 'any model you like'. The usual practice is to use a simple logistic regression but this does not stop you from choosing anything else you like. The idea is to use your original variables (the ones you used to train the individual models) plus the above table you created (i.e. the individual models' predictions) and see whether the new accuracy will be better than the individual ones. In the new combined model you can still use feature elimination techniques like forward or backward selection to remove insignificant variables.

    I hope this answers your questions.