Search code examples
scikit-learndata-sciencexgboostkaggle

Python - RandomForestClassifier and XGBClassifier have exact same score


Question: Could you help me understand why RandomForestClassifier and XGBClassifier have exact same score?

Context: I'm working on Kaggle - Titanic problem and on my first attempt, I want to compare some common models.

Code:

pipeline = make_pipeline(SimpleImputer(strategy='most_frequent'), OneHotEncoder())
preprocessor = make_column_transformer(
    (pipeline, ['Embarked']),
    (OneHotEncoder(), ['Sex']),
    #(OrdinalEncoder(), ['Cabin'])
)

models = [
    RandomForestClassifier(n_estimators=1, random_state=42),
    XGBClassifier(random_state=42, n_estimators=100, max_depth=42),
    SGDClassifier()
]

my_pipelines = []
for model in models:
    my_pipelines.append(Pipeline(steps=[('preprocessor', preprocessor),
                              ('model', model)
                             ]))

for idx, pipeline in enumerate(my_pipelines):
    pipeline.fit(X_train, y_train)
    pred = pipeline.predict(X_valid)
    print(accuracy_score(y_valid, pred))

Output:

0.770949720670391
0.770949720670391
0.6312849162011173

Thank you very much for your help!


Solution

  • This is true that both algorithms are tree based. However, you can see that you have a single tree in the RandomForestClassifier so you are actually a DecisionTreeClassifier while using an ensemble for the gradient-boosting algorithm. One could expect different results.

    Thus the only thing that makes the performance to be equal is actually your data. You have only 2 features which are moreover categorical features. Therefore, with these data, you cannot learn a complex model. All trees should be identical. you could check the number of node in the tree (e.g. my_pipelines[0][-1].estimators_[0].tree_.node_count; I have only 11).

    Add 2 additional numerical features (e.g. fare and age) and you will see that the trees can further find additional rules and the performance will then change.