I want to use StackingClassifier & VotingClassifier with StratifiedKFold & cross_val_score. I am getting nan values in cross_val_score if I use StackingClassifier or VotingClassifier. If I use any other algorithm instead of StackingClassifier or VotingClassifier,cross_val_score works fine. I am using python 3.8.5 & sklearn 0.23.2.
Updating code to working example. Please use this Parkinons dataset from kaggle Parkinsons Dataset This is the dataset I have been working on and below are the exact steps I have followed.
import numpy as np
import pandas as pd
from sklearn import datasets
from sklearn import preprocessing
from sklearn import metrics
from sklearn import model_selection
from sklearn import feature_selection
from imblearn.over_sampling import SMOTE
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import StackingClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.ensemble import RandomForestClassifier
import warnings
warnings.filterwarnings('ignore')
dataset = pd.read_csv('parkinsons.csv')
FS_X=dataset.iloc[:,:-1]
FS_y=dataset.iloc[:,-1:]
FS_X.drop(['name'],axis=1,inplace=True)
select_k_best = feature_selection.SelectKBest(score_func=feature_selection.f_classif,k=15)
X_k_best = select_k_best.fit_transform(FS_X,FS_y)
supportList = select_k_best.get_support().tolist()
p_valuesList = select_k_best.pvalues_.tolist()
toDrop=[]
for i in np.arange(len(FS_X.columns)):
bool = supportList[i]
if(bool == False):
toDrop.append(FS_X.columns[i])
FS_X.drop(toDrop,axis=1,inplace=True)
smote = SMOTE(random_state=7)
Balanced_X,Balanced_y = smote.fit_sample(FS_X,FS_y)
before = pd.merge(FS_X,FS_y,right_index=True, left_index=True)
after = pd.merge(Balanced_X,Balanced_y,right_index=True, left_index=True)
b=before['status'].value_counts()
a=after['status'].value_counts()
print('Before')
print(b)
print('After')
print(a)
SkFold = model_selection.StratifiedKFold(n_splits=10, random_state=7, shuffle=False)
estimators_list = list()
KNN = KNeighborsClassifier()
RF = RandomForestClassifier(criterion='entropy',random_state = 1)
DT = DecisionTreeClassifier(criterion='entropy',random_state = 1)
GNB = GaussianNB()
LR = LogisticRegression(random_state = 1)
estimators_list.append(LR)
estimators_list.append(RF)
estimators_list.append(DT)
estimators_list.append(GNB)
SCLF = StackingClassifier(estimators = estimators_list,final_estimator = KNN,stack_method = 'predict_proba',cv=SkFold,n_jobs = -1)
VCLF = VotingClassifier(estimators = estimators_list,voting = 'soft',n_jobs = -1)
scores1 = model_selection.cross_val_score(estimator = SCLF,X=Balanced_X.values,y=Balanced_y.values,scoring='accuracy',cv=SkFold)
print('StackingClassifier Scores',scores1)
scores2 = model_selection.cross_val_score(estimator = VCLF,X=Balanced_X.values,y=Balanced_y.values,scoring='accuracy',cv=SkFold)
print('VotingClassifier Scores',scores2)
scores3 = model_selection.cross_val_score(estimator = DT,X=Balanced_X.values,y=Balanced_y.values,scoring='accuracy',cv=SkFold)
print('DecisionTreeClassifier Scores',scores3)
Output
Before
1 147
0 48
Name: status, dtype: int64
After
1 147
0 147
Name: status, dtype: int64
StackingClassifier Scores [nan nan nan nan nan nan nan nan nan nan]
VotingClassifier Scores [nan nan nan nan nan nan nan nan nan nan]
DecisionTreeClassifier Scores [0.86666667 0.9 0.93333333 0.86666667 0.96551724 0.82758621
0.75862069 0.86206897 0.86206897 0.93103448]
I checked some other related posts on Stackoverflow but couldn't solve my issue. I am not able to understand where I am going wrong.
The estimators_list
as it is passed to StackingClassifier
or VotingClassifier
is incorrect. The documentation on sklearn for StackingClassifier says:
Base estimators which will be stacked together. Each element of the list is defined as a tuple of string (i.e. name) and an estimator instance. An estimator can be set to ‘drop’ using set_params.
So a correct list would look the following:
KNN = KNeighborsClassifier()
DT = DecisionTreeClassifier(criterion="entropy")
GNB = GaussianNB()
estimators_list = [("KNN", KNN), ("DT", DT), ("GNB", GNB)]
And a complete minimal working example with your parkinsons data could look like this:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import StackingClassifier
dataset = pd.read_csv("parkinsons.csv")
FS_X = dataset.drop(["name", "status"], axis=1)
FS_y = dataset["status"]
estimators_list = [("KNN", KNeighborsClassifier()), ("DT", DecisionTreeClassifier(criterion="entropy")), ("GNB", GaussianNB())]
SCLF = StackingClassifier(estimators=estimators_list)
X_train, X_test, y_train, y_test = train_test_split(FS_X, FS_y)
SCLF.fit(X_train, y_train)
print("SCLF: ", accuracy_score(y_test, SCLF.predict(X_test)))