I don't understand why I have three different behaviors depending on the classifier I use, even though they should go hand in hand.
This is the code in order to go deeply in the question:
from sklearn import datasets
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from lightgbm import LGBMClassifier
from sklearn.model_selection import cross_validate
import matplotlib.pyplot as plt
import numpy as np
#load data
wine = datasets.load_wine()
X = wine.data
y = wine.target
# some helper functions
def repeat_feature(X,which=1,times=1):
return np.hstack([X,np.hstack([X[:, :which]]*times)])
def do_the_job(X,y,clf):
return np.mean(cross_validate(clf, X, y,cv=5)['test_score'])
# define the classifiers
clf1=DecisionTreeClassifier(max_depth=25,random_state=42)
clf2=RandomForestClassifier(n_estimators=5,random_state=42)
clf3=LGBMClassifier(n_estimators=5,random_state=42)
# repeat up to 50 times the same feature and test the classifiers
clf1_result=[]
clf2_result=[]
clf3_result=[]
for i in range(1,50):
my_x=repeat_feature(X,times=i)
clf1_result.append(do_the_job(my_x,y,clf1))
clf2_result.append(do_the_job(my_x,y,clf2))
clf3_result.append(do_the_job(my_x,y,clf3))
# plot the mean of the cv-scores for each classifier
plt.figure(figsize=(12,7))
plt.plot(clf1_result,label='tree')
plt.plot(clf2_result,label='forest')
plt.plot(clf3_result,label='boost')
plt.legend()
The result of the previous script is the following graph:
What I want to verify is that by adding the same information (like a repeated feature) I would get a decrease in the score (which happens as expected for random forest).
The question is why does this not happen with the other two classifiers instead? Why do their scores remain stable?
Am I missing something from the theoretical point of view?
Ty all
When fitting a single decision tree (sklearn.tree.DecisionTreeClassifier
)
or a LightGBM model using its default behavior (lightgbm.LGBMClassifier
), the training algorithm considers all features as candidates for every split, and always chooses the split with the best "gain" (reduction in the training loss).
Because of this, adding multiple identical copies of the same feature will not change the fit to the training data.
For random forest, on the other hand, the training algorithm randomly selects a subset of features to consider at each split. The random forest learns how to explain the training data by ensembling together multiple slightly-different models, and this can be effective because the different models explain different characteristics of the target. If you hold the number of trees + the number of leaves per tree constant, then adding copies of a feature reduces the diversity of the trees in the forest, which reduces the forest's fit to the training data.