python machine-learning scikit-learn classification cross-validation

Why is the results different in getting the top predications in sklearn in python?

I have a dataset with 1000 data points. Each data point is assigned label 1 or 0 as follows.

My dataset:

node, feature1, feature2, ........, Label
x1,   0.8, 0.9, ........, 1
x2,   0.2, 0.6, ........, 1
...
x999, 0.1, 0.1, ........, 0
x1000,0.8, 0.9, ........, 1

I want to perform a binary classification and rank my data points based on prediction probability for class 1. For that I am currently using the predict_proba function in sklearn. So my output should look like as follows.

My expected output:

node prediction_probability_of_class_1
x8,  1.0
x5,  1.0
x990,0.95
x78, 0.92
x85, 0.91
x6,  0.90
and so on ........

I have been trying to do this since a while, using the following two approaches. However, the results I get do not match each other. So, I think one of my approaches (or both) are incorrect.

Since, my dataset belongs to my company and includes sensitive data, I will show my two approaches using iris dataset that has 150 data points.

from sklearn import datasets
iris = datasets.load_iris()
X = iris.data
y = iris.target

My approach 1:

#random forest classifier
clf=RandomForestClassifier(n_estimators=10, random_state = 42, class_weight="balanced")
#perform 10 fold cross validation
k_fold = StratifiedKFold(n_splits=10, shuffle=True, random_state=0)
#get predict_proba for each instance
proba = cross_val_predict(clf, X, y, cv=k_fold, method='predict_proba')
#get the probability of class 1
print(proba[:,1])
#get the datapoint index of each probaility
print(np.argsort(proba[:,1]))

So my results looks as follows.

#probaility of each data point for class 1
[0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.1 0.  0.  0.
 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
 0.2 0.  0.  0.  0.  0.1 0.  0.  0.  0.  0.  0.  0.  0.  0.9 1.  0.7 1.
 1.  1.  1.  0.7 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  0.9 0.9 0.1 1.
 0.6 1.  1.  1.  0.9 0.  1.  1.  1.  1.  1.  0.4 0.9 0.9 1.  1.  1.  0.9
 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  0.  0.  0.  0.  0.  0.  0.9 0.
 0.1 0.  0.  0.  0.  0.  0.  0.  0.1 0.  0.  0.8 0.  0.1 0.  0.1 0.  0.1
 0.3 0.2 0.  0.6 0.  0.  0.  0.6 0.4 0.  0.  0.  0.8 0.  0.  0.  0.  0.
 0.  0.  0.  0.  0.  0. ]

#corresponding index of the above data points
[  0 113 112 111 110 109 107 105 104 114 103 101 100  77 148  49  48  47
  46 102 115 117 118 147 146 145 144 143 142 141 140 139 137 136 135 132
 131 130 128 124 122 120  45  44 149  42  15  26  16  17  18  19  20  21
  22  43  23  24  35  34  33  32  31  30  29  28  27  37  13  25   9  10
   7   6   5   4   3   8  11   2   1  38  39  40  12 108 116  41 121  70
  14 123 125  36 127 126 134  83  72 133 129  52  57 119 138  89  76  50
  84 106  85  69  68  97  98  66  65  64  63  62  61  67  60  58  56  55
  54  53  51  59  71  73  75  96  95  94  93  92  91  90  88  87  86  82
  81  80  79  78  99  74]

My approach 2:

Since cross_val_predict I am using above do not have fit method, I cannot access data such as clf.classes_. Therefore, I am using the below code.

cv_1 = cross_val_score(clf, X, y, cv=k_fold)
clf.fit(X, y)
probabilities = pd.DataFrame(clf.predict_proba(X), columns=clf.classes_)
probabilities['Y'] = y
probabilities.columns.name = 'Classes'
print(probabilities.sort_values(1))

My results are as follows.

Classes    0    1    2  Y
0        1.0  0.0  0.0  0
115      0.0  0.0  1.0  2
114      0.0  0.0  1.0  2
113      0.0  0.0  1.0  2
112      0.0  0.0  1.0  2
111      0.0  0.0  1.0  2
110      0.0  0.0  1.0  2
109      0.0  0.0  1.0  2
108      0.0  0.0  1.0  2
107      0.0  0.0  1.0  2
105      0.0  0.0  1.0  2
104      0.0  0.0  1.0  2
103      0.0  0.0  1.0  2
102      0.0  0.0  1.0  2
101      0.0  0.0  1.0  2
100      0.0  0.0  1.0  2
148      0.0  0.0  1.0  2
49       1.0  0.0  0.0  0
48       1.0  0.0  0.0  0
47       1.0  0.0  0.0  0
116      0.0  0.0  1.0  2
46       1.0  0.0  0.0  0
117      0.0  0.0  1.0  2
120      0.0  0.0  1.0  2
147      0.0  0.0  1.0  2
146      0.0  0.0  1.0  2
145      0.0  0.0  1.0  2
144      0.0  0.0  1.0  2
143      0.0  0.0  1.0  2
142      0.0  0.0  1.0  2
..       ...  ...  ... ..
63       0.0  1.0  0.0  1
59       0.0  1.0  0.0  1
58       0.0  1.0  0.0  1
55       0.0  1.0  0.0  1
54       0.0  1.0  0.0  1
53       0.0  1.0  0.0  1
51       0.0  1.0  0.0  1
50       0.0  1.0  0.0  1
61       0.0  1.0  0.0  1
99       0.0  1.0  0.0  1
76       0.0  1.0  0.0  1
79       0.0  1.0  0.0  1
96       0.0  1.0  0.0  1
95       0.0  1.0  0.0  1
94       0.0  1.0  0.0  1
93       0.0  1.0  0.0  1
92       0.0  1.0  0.0  1
91       0.0  1.0  0.0  1
90       0.0  1.0  0.0  1
78       0.0  1.0  0.0  1
89       0.0  1.0  0.0  1
87       0.0  1.0  0.0  1
86       0.0  1.0  0.0  1
85       0.0  1.0  0.0  1
84       0.0  1.0  0.0  1
82       0.0  1.0  0.0  1
81       0.0  1.0  0.0  1
80       0.0  1.0  0.0  1
88       0.0  1.0  0.0  1
74       0.0  1.0  0.0  1

As you can see, the probability values of class 1 for each data point in the two approaches are not equivalent. Consider data point 88, it is 0 in approach 1, and 1 in approach 2.

Therefore, I would like to know what is the correct way to do this in python. Note: I want to perform 10-fold cross validation to obtain my probaility values.

I am happy to provide more details if needed.

Solution

I've added a little portion of code to yours. Erasing the last print, you can add the following code to see the difference between the two predictions:

probabilities['other methode'] = proba[:,1]
probabilities['diff'] = probabilities[1]-probabilities['other method']
probabilities[probabilities['diff'] != 0]

and the results is the following:

Classes 0    1        2     Y   other method diff
20   1.0    0.0     0.0     0   0.1         -0.1
36   1.0    0.0     0.0     0   0.1         -0.1
41   1.0    0.0     0.0     0   0.1         -0.1
50   0.0    1.0     0.0     1   0.9         0.1
52   0.0    0.9     0.1     1   1.0         -0.1
56   0.0    0.9     0.1     1   1.0         -0.1
57   0.0    0.9     0.1     1   1.0         -0.1
59   0.0    1.0     0.0     1   0.9         0.1
60   0.0    0.9     0.1     1   1.0         -0.1
68   0.0    0.9     0.1     1   1.0         -0.1
... ... ... ... ... ... ...
123  0.0    0.2     0.8     2   0.4         -0.2
127  0.0    0.2     0.8     2   0.1         0.1
129  0.0    0.1     0.9     2   0.6         -0.5
133  0.0    0.1     0.9     2   0.9         -0.8
134  0.0    0.2     0.8     2   0.6         -0.4
137  0.0    0.0     1.0     2   0.1         -0.1
138  0.0    0.3     0.7     2   0.6         -0.3
141  0.0    0.0     1.0     2   0.1         -0.1
142  0.0    0.0     1.0     2   0.1         -0.1
146  0.0    0.0     1.0     2   0.1         -0.1

and you see there is indeed a difference between those two for 29 elements. So why would you ask? well it's because you are not training the algorithm the same way:

clf.fit(X, y)
clf.predict_proba(X)

and

cross_val_predict(clf, X, y, cv=k_fold, method='predict_proba')

are not the same. For one you are using the cross validation method to ensure robustness while in the other you only train it once.

the results are then different but not by far for most cases. For example, if we remove all elements that are at a diff <0.1 we then only get 12 elements. The CVkfold is helping taking care of ambiguous terms and it must be them. hope it helps. tell me if you have doubts.

EDIT

to answer the comment yes, the CV is a better idea. following you update, i think the best way is to use the dataframe you already have at the beginning and then sort it:

df = pd.DataFrame(index=['x1','x2',...,'x1000'],columns=['prediction_class_1']).fillna(0)
df['prediction_class_1'] = clf.predict(X) #clf trained and X the features values
print(df.sort_values('prediction_class_1'))

X = train[['feature1','feature2', ........,'featuren']].values