Search code examples
pythonscikit-learncross-validationk-fold

StratifiedKFold split train and validation set size


I am using StratifiedKFold and I am not sure what is the training and test size returned by kfold.split in my code below. Assuming Print(array.shape) returns (12904, 47) i.e number of rows are 12904 and number of columns are 47, what would be the training and test size?

kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=8)

for train, validation in kfold.split(X, Y):
            # Fit the model
            model.fit(X[train], Y[train])
            # predict probabilities for training set
            predicted = model.predict(X[train])

            predicted_report = classification_report(Y[train], predicted)
            print(predicted_report)
            # accuracy: (tp + tn) / (p + n)
            accuracy = accuracy_score(Y[train], predicted)#accuracy_score(Y[train], yhat_classes)

Solution

  • As already hinted in the comments, your training set size will be (n_splits-1)/n_splits, and your validation set size wil be 1/n_splits of the size of your initial data, i.e. here 4/5 and 1/5, respectively.

    Here is a simple reproducible demonstration using the iris data and n_splits=5, as in your case:

    import numpy as np
    from sklearn.model_selection import StratifiedKFold
    from sklearn.datasets import load_iris
    
    iris = load_iris()
    X = iris.data
    y = iris.target
    print(X.shape) # initial dataset size
    # (150, 4)
    
    kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=8)
    
    for train, validation in kfold.split(X, y):
                print(X[train].shape, X[validation].shape)
    

    The result of which is:

    (120, 4) (30, 4)
    (120, 4) (30, 4)
    (120, 4) (30, 4)
    (120, 4) (30, 4)
    (120, 4) (30, 4)
    

    So, to check for yourself in your data, you just need to add the above print statement in your for-loop.