I am using StratifiedKFold
and I am not sure what is the training and test size returned by kfold.split
in my code below. Assuming Print(array.shape)
returns (12904, 47)
i.e number of rows are 12904 and number of columns are 47, what would be the training and test size?
kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=8)
for train, validation in kfold.split(X, Y):
# Fit the model
model.fit(X[train], Y[train])
# predict probabilities for training set
predicted = model.predict(X[train])
predicted_report = classification_report(Y[train], predicted)
print(predicted_report)
# accuracy: (tp + tn) / (p + n)
accuracy = accuracy_score(Y[train], predicted)#accuracy_score(Y[train], yhat_classes)
As already hinted in the comments, your training set size will be (n_splits-1)/n_splits
, and your validation set size wil be 1/n_splits
of the size of your initial data, i.e. here 4/5 and 1/5, respectively.
Here is a simple reproducible demonstration using the iris data and n_splits=5
, as in your case:
import numpy as np
from sklearn.model_selection import StratifiedKFold
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y = iris.target
print(X.shape) # initial dataset size
# (150, 4)
kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=8)
for train, validation in kfold.split(X, y):
print(X[train].shape, X[validation].shape)
The result of which is:
(120, 4) (30, 4)
(120, 4) (30, 4)
(120, 4) (30, 4)
(120, 4) (30, 4)
(120, 4) (30, 4)
So, to check for yourself in your data, you just need to add the above print
statement in your for-loop.