Search code examples
scikit-learnrandom-forestdecision-treeboosting

Tree based algorithm different behavior with duplicated features


I don't understand why I have three different behaviors depending on the classifier I use, even though they should go hand in hand.

This is the code in order to go deeply in the question:

from sklearn import datasets
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from lightgbm import LGBMClassifier 
from sklearn.model_selection import cross_validate
import matplotlib.pyplot as plt
import numpy as np

#load data

wine = datasets.load_wine()
X = wine.data
y = wine.target

# some helper functions

def repeat_feature(X,which=1,times=1):
    return np.hstack([X,np.hstack([X[:, :which]]*times)])

def do_the_job(X,y,clf):
    return np.mean(cross_validate(clf, X, y,cv=5)['test_score'])

# define the classifiers

clf1=DecisionTreeClassifier(max_depth=25,random_state=42)
clf2=RandomForestClassifier(n_estimators=5,random_state=42)
clf3=LGBMClassifier(n_estimators=5,random_state=42)


# repeat up to 50 times the same feature and test the classifiers

clf1_result=[]
clf2_result=[]
clf3_result=[]

for i in range(1,50):
    my_x=repeat_feature(X,times=i)
    clf1_result.append(do_the_job(my_x,y,clf1))
    clf2_result.append(do_the_job(my_x,y,clf2))
    clf3_result.append(do_the_job(my_x,y,clf3))
    
    
# plot the mean of the cv-scores for each classifier    
    
plt.figure(figsize=(12,7))
plt.plot(clf1_result,label='tree')
plt.plot(clf2_result,label='forest')
plt.plot(clf3_result,label='boost')
plt.legend()

The result of the previous script is the following graph: enter image description here

What I want to verify is that by adding the same information (like a repeated feature) I would get a decrease in the score (which happens as expected for random forest).

The question is why does this not happen with the other two classifiers instead? Why do their scores remain stable?

Am I missing something from the theoretical point of view?

Ty all


Solution

  • When fitting a single decision tree (sklearn.tree.DecisionTreeClassifier) or a LightGBM model using its default behavior (lightgbm.LGBMClassifier), the training algorithm considers all features as candidates for every split, and always chooses the split with the best "gain" (reduction in the training loss).

    Because of this, adding multiple identical copies of the same feature will not change the fit to the training data.

    For random forest, on the other hand, the training algorithm randomly selects a subset of features to consider at each split. The random forest learns how to explain the training data by ensembling together multiple slightly-different models, and this can be effective because the different models explain different characteristics of the target. If you hold the number of trees + the number of leaves per tree constant, then adding copies of a feature reduces the diversity of the trees in the forest, which reduces the forest's fit to the training data.