python machine-learning scikit-learn random-forest decision-tree

Building a Random Forest Classifier with equal output probabilities to a Decision Tree Classifier

I have been trying to build a RandomForestClassifier() (RF) model and a DecisionTreeClassifier() (DT) model in order to get the same output (only for learning purposes). I have found some questions with answers where I used those answers to build this code, like the required parameters to make both models equal but I can't find a code that actually does it, so I'm trying build that code:

import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

random_seed = 42

X, y = make_classification(
    n_samples=100000,
    n_features=5,
    random_state=random_seed
)

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=random_seed)

DT = DecisionTreeClassifier(criterion='gini',             # default
                            splitter='best',              # default
                            max_depth=None,               # default
                            min_samples_split=3,          # default
                            min_samples_leaf=1,           # default
                            min_weight_fraction_leaf=0.0, # default
                            max_features=None,            # default
                            random_state=random_seed,     # NON-default
                            max_leaf_nodes=None,          # default
                            min_impurity_decrease=0.0,    # default
                            class_weight=None,            # default
                            ccp_alpha=0.0                 # default
                           )
DT.fit(X_train, y_train)

RF = RandomForestClassifier(n_estimators=1,               # NON-default
                            criterion='gini',             # default
                            max_depth=None,               # default
                            min_samples_split=3,          # default
                            min_samples_leaf=1,           # default
                            min_weight_fraction_leaf=0.0, # default
                            max_features=None,            # NON-default
                            max_leaf_nodes=None,          # default 
                            min_impurity_decrease=0.0,    # default
                            bootstrap=False,              # NON-default
                            oob_score=False,              # default 
                            n_jobs=None,                  # default
                            random_state=random_seed,     # NON-default
                            verbose=0,                    # default
                            warm_start=False,             # default
                            class_weight=None,            # default
                            ccp_alpha=0.0,                # default
                            max_samples=None              # default
                           )

RF.fit(X_train, y_train)

RF_pred =  RF.predict(X_test)
RF_proba = RF.predict_proba(X_test)
DT_pred =  DT.predict(X_test)
DT_proba = DT.predict_proba(X_test)


# Here we validate that the outputs are actually equal, with their respective percentage of how many rows are NOT equal
print('If DT_pred = RF_pred:',np.array_equal(DT_pred, RF_pred), '; Percentage of not equal:', (DT_pred != RF_pred).sum()/len(DT_pred))
print('If DT_proba = RF_proba:', np.array_equal(DT_proba, RF_proba), '; Percentage of not equal:', (DT_proba != RF_proba).sum()/len(DT_proba))

# A plot that shows where those differences are concentrated
sns.set(style="darkgrid")
mask = (RF_proba[:,1] - DT_proba[:,1]) != 0
only_differences = (RF_proba[:,1] - DT_proba[:,1])[mask]
sns.kdeplot(only_differences, shade=True, color="r")
plt.title('Plot of only differences in probs scores')
plt.show()

Output:

I even found an answer that compares an XGBoost with DecisionTree saying they are almost identical, and when I test their probabilities outputs they are fairly different.

So, am I doing something wrong here? How can I get the same probabilities for those two models? Is there a possibility to get True for those two print() statements in the code above?

Solution

It appears to be due to random states, despite your best efforts. For the random forest to be effective at its randomization, it needs to provide each component decision tree with a different random state (using sklearn.ensemble._base._set_random_states, source). You can check in your code that while RF.random_state and DT.random_state are both 42, RF.estimators_[0].random_state is 1608637542.

When bootstrap=False and max_columns=None, this is only changing some effects for tied-gain splits I believe, and so the results are very close on the training set. That can translate to slightly larger differences on a test set.