Search code examples
pythonmachine-learningscikit-learnrandom-forestdecision-tree

Building a Random Forest Classifier with equal output probabilities to a Decision Tree Classifier


I have been trying to build a RandomForestClassifier() (RF) model and a DecisionTreeClassifier() (DT) model in order to get the same output (only for learning purposes). I have found some questions with answers where I used those answers to build this code, like the required parameters to make both models equal but I can't find a code that actually does it, so I'm trying build that code:

import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

random_seed = 42

X, y = make_classification(
    n_samples=100000,
    n_features=5,
    random_state=random_seed
)

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=random_seed)

DT = DecisionTreeClassifier(criterion='gini',             # default
                            splitter='best',              # default
                            max_depth=None,               # default
                            min_samples_split=3,          # default
                            min_samples_leaf=1,           # default
                            min_weight_fraction_leaf=0.0, # default
                            max_features=None,            # default
                            random_state=random_seed,     # NON-default
                            max_leaf_nodes=None,          # default
                            min_impurity_decrease=0.0,    # default
                            class_weight=None,            # default
                            ccp_alpha=0.0                 # default
                           )
DT.fit(X_train, y_train)

RF = RandomForestClassifier(n_estimators=1,               # NON-default
                            criterion='gini',             # default
                            max_depth=None,               # default
                            min_samples_split=3,          # default
                            min_samples_leaf=1,           # default
                            min_weight_fraction_leaf=0.0, # default
                            max_features=None,            # NON-default
                            max_leaf_nodes=None,          # default 
                            min_impurity_decrease=0.0,    # default
                            bootstrap=False,              # NON-default
                            oob_score=False,              # default 
                            n_jobs=None,                  # default
                            random_state=random_seed,     # NON-default
                            verbose=0,                    # default
                            warm_start=False,             # default
                            class_weight=None,            # default
                            ccp_alpha=0.0,                # default
                            max_samples=None              # default
                           )

RF.fit(X_train, y_train)

RF_pred =  RF.predict(X_test)
RF_proba = RF.predict_proba(X_test)
DT_pred =  DT.predict(X_test)
DT_proba = DT.predict_proba(X_test)


# Here we validate that the outputs are actually equal, with their respective percentage of how many rows are NOT equal
print('If DT_pred = RF_pred:',np.array_equal(DT_pred, RF_pred), '; Percentage of not equal:', (DT_pred != RF_pred).sum()/len(DT_pred))
print('If DT_proba = RF_proba:', np.array_equal(DT_proba, RF_proba), '; Percentage of not equal:', (DT_proba != RF_proba).sum()/len(DT_proba))

# A plot that shows where those differences are concentrated
sns.set(style="darkgrid")
mask = (RF_proba[:,1] - DT_proba[:,1]) != 0
only_differences = (RF_proba[:,1] - DT_proba[:,1])[mask]
sns.kdeplot(only_differences, shade=True, color="r")
plt.title('Plot of only differences in probs scores')
plt.show()

Output:

enter image description here

I even found an answer that compares an XGBoost with DecisionTree saying they are almost identical, and when I test their probabilities outputs they are fairly different.

So, am I doing something wrong here? How can I get the same probabilities for those two models? Is there a possibility to get True for those two print() statements in the code above?


Solution

  • It appears to be due to random states, despite your best efforts. For the random forest to be effective at its randomization, it needs to provide each component decision tree with a different random state (using sklearn.ensemble._base._set_random_states, source). You can check in your code that while RF.random_state and DT.random_state are both 42, RF.estimators_[0].random_state is 1608637542.

    When bootstrap=False and max_columns=None, this is only changing some effects for tied-gain splits I believe, and so the results are very close on the training set. That can translate to slightly larger differences on a test set.