Search code examples
scikit-learnensemble-learningadaboost

AdaBoostClassifier and the 'SAMME.R’ Algorithm


It takes a while to get to the actual question, so please bear with me. The AdaBoost documentation states that it " is a meta-estimator that begins by fitting a classifier on the original dataset and then fits additional copies of the classifier on the same dataset but where the weights of incorrectly classified instances are adjusted". To do that, one of the required paramenters is base_estimator. For the base_estimator to be useable with AdaBoostClassifer, "support for sample weighting is required".

So my first issue was - which classifiers provide support for sample weighting? I did some research, and, fortunately, someone smarter than me had the answer. Somewhat updated, it works thus: by running

from sklearn.utils.testing import all_estimators 

print(all_estimators(type_filter='classifier'))

you get a list of all classifiers (turns out there are 31 of them!). Then, if you run

import inspect

for name, clf in all_estimators(type_filter='classifier'):
    if 'sample_weight' in inspect.getfullargspec(clf().fit)[0]:
        print(name)

you can get the list of all classifiers which provide support for sample weighting (21 of them, for the curious).

So far so good. But now we have to deal with another AdaBoostClassifer parameter, namely algorithm. You have two options: {‘SAMME’, ‘SAMME.R’}, optional (default=’SAMME.R’). We're told that to "use the SAMME.R real boosting algorithm base_estimator must support calculation of class probabilities". And this is where I got stuck. Searching online, I can only find two classifiers used with ‘SAMME.R’ as an argument for algorithm: DecisionTreeClassifier (which is the default) and RandomForestClassifier.

So here's the question - which other classifiers from the 21 which are compatible with AdaBoostClassifer offer support for the calculation of class probablities?

Thanks.


Solution

  • I am pretty sure that when the documentation refers to "must support calculation of class probabilities" they mean that there is a predict_proba method.

    This is the method that many classifiers use to return the probabilities for each class given an observation. With that understanding you just need to check for classifiers that have the predict_proba method:

    for name, clf in all_estimators(type_filter='classifier'):
        if hasattr(clf, 'predict_proba'):
            print(clf, name)
    
    <class 'sklearn.ensemble.weight_boosting.AdaBoostClassifier'> AdaBoostClassifier
    
    <class 'sklearn.ensemble.bagging.BaggingClassifier'> BaggingClassifier
    <class 'sklearn.naive_bayes.BernoulliNB'> BernoulliNB
    <class 'sklearn.calibration.CalibratedClassifierCV'> CalibratedClassifierCV
    <class 'sklearn.naive_bayes.ComplementNB'> ComplementNB
    <class 'sklearn.tree.tree.DecisionTreeClassifier'> DecisionTreeClassifier
    <class 'sklearn.tree.tree.ExtraTreeClassifier'> ExtraTreeClassifier
    <class 'sklearn.ensemble.forest.ExtraTreesClassifier'> ExtraTreesClassifier
    <class 'sklearn.naive_bayes.GaussianNB'> GaussianNB
    <class 'sklearn.gaussian_process.gpc.GaussianProcessClassifier'> GaussianProcess
    Classifier
    <class 'sklearn.ensemble.gradient_boosting.GradientBoostingClassifier'> GradientBoosti
    ngClassifier
    <class 'sklearn.neighbors.classification.KNeighborsClassifier'> KNeighborsClassifier
    <class 'sklearn.semi_supervised.label_propagation.LabelPropagation'> LabelPropagation
    <class 'sklearn.semi_supervised.label_propagation.LabelSpreading'> LabelSpreading
    <class 'sklearn.discriminant_analysis.LinearDiscriminantAnalysis'> LinearDiscriminantA
    nalysis
    <class 'sklearn.linear_model.logistic.LogisticRegression'> LogisticRegression
    <class 'sklearn.linear_model.logistic.LogisticRegressionCV'> LogisticRegressionCV
    <class 'sklearn.neural_network.multilayer_perceptron.MLPClassifier'> MLPClassifier
    <class 'sklearn.naive_bayes.MultinomialNB'> MultinomialNB
    <class 'sklearn.svm.classes.NuSVC'> NuSVC
    <class 'sklearn.discriminant_analysis.QuadraticDiscriminantAnalysis'> QuadraticDiscrim
    inantAnalysis
    <class 'sklearn.ensemble.forest.RandomForestClassifier'> RandomForestClassifier
    <class 'sklearn.linear_model.stochastic_gradient.SGDClassifier'> SGDClassifier
    <class 'sklearn.svm.classes.SVC'> SVC
    

    So you end up with 24 of the 31 classifiers as being potential options for base_estimator in AdaBoostClassifier.

    The error returned from using an improper classifier as base_estimator is also quite helpful in this regard.

    TypeError: AdaBoostClassifier with algorithm='SAMME.R' requires that the weak learner supports the calculation of class probabilities with a predict_proba method. Please change the base estimator or set algorithm='SAMME' instead.

    As you can see the error specifically points you towards classes with the predict_proba method.