python scikit-learn random-forest multiclass-classification

How to get independent probabilities of all classes for each sample with predict_proba?

In my work, there is a feature set consisting entirely of Boolean data and there are classes belonging to the features. Classes are string.

     feature set              class (String)
[True False True   ...]        "A"
[True True  True   ...]        "B"
[True True  False   ...]       "C"

When I train these data with the Random Forest algorithm,

factor = pd.factorize(classes)
classes = factor[0]

classifier = RandomForestClassifier(n_estimators=100, criterion="entropy", random_state=0)
classifier.fit(x_train, classes)

classifier can detect 97% of classes correctly. When I do

classifier.predict_proba(sample1_feature_set)

it gives relative probabilities of each class for sample1. For example; Like

 [0.80    0.05    0.15]
   ↓        ↓        ↓
  Prob.    Prob.    Prob.
   of       of       of
  "A"      "B"      "C" 
  for      for      for
sample1   sample1  sample1

so when I add the values of list (0.80 + 0.05 + 0.15), the result is always 1. This shows that it actually makes relative evaluation, that is the probability of one class affects the probability of the other class.

I want to get the independent probabilities of all classes for sample1, like

 [0.95    0.69    0.87]
   ↓        ↓        ↓
  Prob.    Prob.    Prob.
   of       of       of
  "A"      "B"      "C" 
  for      for      for
sample1   sample1  sample1

Sample1 is %95 of "A", %69 of "B" and %87 of "C" class. Do you have any idea how I can do this?

Solution

Random forest is an ensemble method. Basically it builds individual decision trees with different subsets of the data (called bagging) and averages predictions across all trees to give you the probabilities. The help page is actually a good place to start:

In averaging methods, the driving principle is to build several estimators independently and then to average their predictions. On average, the combined estimator is usually better than any of the single base estimator because its variance is reduced.

Examples: Bagging methods, Forests of randomized trees, …

Hence the probabilities will always sum up to one. Below is an example of how you access individual prediction of each tree:

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
iris = load_iris()

X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.33, random_state=42)

from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=10)
model.fit(X_train, y_train)

pred = model.predict_proba(X_test)
pred[:5,:]

array([[0. , 1. , 0. ],
       [1. , 0. , 0. ],
       [0. , 0. , 1. ],
       [0. , 0.9, 0.1],
       [0. , 0.9, 0.1]])

This is the prediction for the first tree:

model.estimators_[0].predict(X_test)
Out[42]: 
array([1., 0., 2., 2., 1., 0., 1., 2., 2., 1., 2., 0., 0., 0., 0., 2., 2.,
       1., 1., 2., 0., 2., 0., 2., 2., 2., 2., 2., 0., 0., 0., 0., 1., 0.,
       0., 2., 1., 0., 0., 0., 2., 2., 1., 0., 0., 1., 1., 2., 1., 2.])

We tally across all trees:

result = np.zeros((len(X_test),3))
for i in range(len(model.estimators_)):
    p = model.estimators_[i].predict(X_test).astype(int)
    result[range(len(X_test)),p] += 1

result[:5,:]
Out[63]: 
array([[ 0., 10.,  0.],
       [10.,  0.,  0.],
       [ 0.,  0., 10.],
       [ 0.,  9.,  1.],
       [ 0.,  9.,  1.]])

Dividing this by the number of trees gives the probability you obtained before:

result/10
Out[65]: 
array([[0. , 1. , 0. ],
       [1. , 0. , 0. ],
       [0. , 0. , 1. ],
       [0. , 0.9, 0.1],
       [0. , 0.9, 0.1],