I have a dataset with 1000 data points. Each data point is assigned label 1
or 0
as follows.
My dataset:
node, feature1, feature2, ........, Label
x1, 0.8, 0.9, ........, 1
x2, 0.2, 0.6, ........, 1
...
x999, 0.1, 0.1, ........, 0
x1000,0.8, 0.9, ........, 1
I want to perform a binary classification and rank my data points based on prediction probability for class 1
. For that I am currently using the predict_proba
function in sklearn. So my output should look like as follows.
My expected output:
node prediction_probability_of_class_1
x8, 1.0
x5, 1.0
x990,0.95
x78, 0.92
x85, 0.91
x6, 0.90
and so on ........
I have been trying to do this since a while, using the following two approaches. However, the results I get do not match each other. So, I think one of my approaches (or both) are incorrect.
Since, my dataset belongs to my company and includes sensitive data, I will show my two approaches using iris dataset
that has 150 data points.
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data
y = iris.target
My approach 1:
#random forest classifier
clf=RandomForestClassifier(n_estimators=10, random_state = 42, class_weight="balanced")
#perform 10 fold cross validation
k_fold = StratifiedKFold(n_splits=10, shuffle=True, random_state=0)
#get predict_proba for each instance
proba = cross_val_predict(clf, X, y, cv=k_fold, method='predict_proba')
#get the probability of class 1
print(proba[:,1])
#get the datapoint index of each probaility
print(np.argsort(proba[:,1]))
So my results looks as follows.
#probaility of each data point for class 1
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.1 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0.2 0. 0. 0. 0. 0.1 0. 0. 0. 0. 0. 0. 0. 0. 0.9 1. 0.7 1.
1. 1. 1. 0.7 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0.9 0.9 0.1 1.
0.6 1. 1. 1. 0.9 0. 1. 1. 1. 1. 1. 0.4 0.9 0.9 1. 1. 1. 0.9
1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0.9 0.
0.1 0. 0. 0. 0. 0. 0. 0. 0.1 0. 0. 0.8 0. 0.1 0. 0.1 0. 0.1
0.3 0.2 0. 0.6 0. 0. 0. 0.6 0.4 0. 0. 0. 0.8 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. ]
#corresponding index of the above data points
[ 0 113 112 111 110 109 107 105 104 114 103 101 100 77 148 49 48 47
46 102 115 117 118 147 146 145 144 143 142 141 140 139 137 136 135 132
131 130 128 124 122 120 45 44 149 42 15 26 16 17 18 19 20 21
22 43 23 24 35 34 33 32 31 30 29 28 27 37 13 25 9 10
7 6 5 4 3 8 11 2 1 38 39 40 12 108 116 41 121 70
14 123 125 36 127 126 134 83 72 133 129 52 57 119 138 89 76 50
84 106 85 69 68 97 98 66 65 64 63 62 61 67 60 58 56 55
54 53 51 59 71 73 75 96 95 94 93 92 91 90 88 87 86 82
81 80 79 78 99 74]
My approach 2:
Since cross_val_predict
I am using above do not have fit
method, I cannot access data such as clf.classes_
. Therefore, I am using the below code.
cv_1 = cross_val_score(clf, X, y, cv=k_fold)
clf.fit(X, y)
probabilities = pd.DataFrame(clf.predict_proba(X), columns=clf.classes_)
probabilities['Y'] = y
probabilities.columns.name = 'Classes'
print(probabilities.sort_values(1))
My results are as follows.
Classes 0 1 2 Y
0 1.0 0.0 0.0 0
115 0.0 0.0 1.0 2
114 0.0 0.0 1.0 2
113 0.0 0.0 1.0 2
112 0.0 0.0 1.0 2
111 0.0 0.0 1.0 2
110 0.0 0.0 1.0 2
109 0.0 0.0 1.0 2
108 0.0 0.0 1.0 2
107 0.0 0.0 1.0 2
105 0.0 0.0 1.0 2
104 0.0 0.0 1.0 2
103 0.0 0.0 1.0 2
102 0.0 0.0 1.0 2
101 0.0 0.0 1.0 2
100 0.0 0.0 1.0 2
148 0.0 0.0 1.0 2
49 1.0 0.0 0.0 0
48 1.0 0.0 0.0 0
47 1.0 0.0 0.0 0
116 0.0 0.0 1.0 2
46 1.0 0.0 0.0 0
117 0.0 0.0 1.0 2
120 0.0 0.0 1.0 2
147 0.0 0.0 1.0 2
146 0.0 0.0 1.0 2
145 0.0 0.0 1.0 2
144 0.0 0.0 1.0 2
143 0.0 0.0 1.0 2
142 0.0 0.0 1.0 2
.. ... ... ... ..
63 0.0 1.0 0.0 1
59 0.0 1.0 0.0 1
58 0.0 1.0 0.0 1
55 0.0 1.0 0.0 1
54 0.0 1.0 0.0 1
53 0.0 1.0 0.0 1
51 0.0 1.0 0.0 1
50 0.0 1.0 0.0 1
61 0.0 1.0 0.0 1
99 0.0 1.0 0.0 1
76 0.0 1.0 0.0 1
79 0.0 1.0 0.0 1
96 0.0 1.0 0.0 1
95 0.0 1.0 0.0 1
94 0.0 1.0 0.0 1
93 0.0 1.0 0.0 1
92 0.0 1.0 0.0 1
91 0.0 1.0 0.0 1
90 0.0 1.0 0.0 1
78 0.0 1.0 0.0 1
89 0.0 1.0 0.0 1
87 0.0 1.0 0.0 1
86 0.0 1.0 0.0 1
85 0.0 1.0 0.0 1
84 0.0 1.0 0.0 1
82 0.0 1.0 0.0 1
81 0.0 1.0 0.0 1
80 0.0 1.0 0.0 1
88 0.0 1.0 0.0 1
74 0.0 1.0 0.0 1
As you can see, the probability values of class 1
for each data point in the two approaches are not equivalent. Consider data point 88
, it is 0
in approach 1, and 1
in approach 2.
Therefore, I would like to know what is the correct way to do this in python. Note: I want to perform 10-fold cross validation
to obtain my probaility values.
I am happy to provide more details if needed.
I've added a little portion of code to yours. Erasing the last print, you can add the following code to see the difference between the two predictions:
probabilities['other methode'] = proba[:,1]
probabilities['diff'] = probabilities[1]-probabilities['other method']
probabilities[probabilities['diff'] != 0]
and the results is the following:
Classes 0 1 2 Y other method diff
20 1.0 0.0 0.0 0 0.1 -0.1
36 1.0 0.0 0.0 0 0.1 -0.1
41 1.0 0.0 0.0 0 0.1 -0.1
50 0.0 1.0 0.0 1 0.9 0.1
52 0.0 0.9 0.1 1 1.0 -0.1
56 0.0 0.9 0.1 1 1.0 -0.1
57 0.0 0.9 0.1 1 1.0 -0.1
59 0.0 1.0 0.0 1 0.9 0.1
60 0.0 0.9 0.1 1 1.0 -0.1
68 0.0 0.9 0.1 1 1.0 -0.1
... ... ... ... ... ... ...
123 0.0 0.2 0.8 2 0.4 -0.2
127 0.0 0.2 0.8 2 0.1 0.1
129 0.0 0.1 0.9 2 0.6 -0.5
133 0.0 0.1 0.9 2 0.9 -0.8
134 0.0 0.2 0.8 2 0.6 -0.4
137 0.0 0.0 1.0 2 0.1 -0.1
138 0.0 0.3 0.7 2 0.6 -0.3
141 0.0 0.0 1.0 2 0.1 -0.1
142 0.0 0.0 1.0 2 0.1 -0.1
146 0.0 0.0 1.0 2 0.1 -0.1
and you see there is indeed a difference between those two for 29 elements. So why would you ask? well it's because you are not training the algorithm the same way:
clf.fit(X, y)
clf.predict_proba(X)
and
cross_val_predict(clf, X, y, cv=k_fold, method='predict_proba')
are not the same. For one you are using the cross validation method to ensure robustness while in the other you only train it once.
the results are then different but not by far for most cases. For example, if we remove all elements that are at a diff <0.1 we then only get 12 elements. The CVkfold is helping taking care of ambiguous terms and it must be them. hope it helps. tell me if you have doubts.
EDIT
to answer the comment yes, the CV is a better idea. following you update, i think the best way is to use the dataframe you already have at the beginning and then sort it:
df = pd.DataFrame(index=['x1','x2',...,'x1000'],columns=['prediction_class_1']).fillna(0)
df['prediction_class_1'] = clf.predict(X) #clf trained and X the features values
print(df.sort_values('prediction_class_1'))
X = train[['feature1','feature2', ........,'featuren']].values