Search code examples
pythonmachine-learningscikit-learncross-validationimblearn

scikit-learn StratifiedKFold implementation


I have hard time to understand scikit-learn's StratifiedKfold from https://scikit-learn.org/stable/modules/cross_validation.html#stratification

and implemented the example part by adding RandomOversample:

X, y = np.ones((50, 1)), np.hstack(([0] * 45, [1] * 5))

from imblearn.over_sampling import RandomOverSampler
ros = RandomOverSampler(sampling_strategy='minority',random_state=0)
X_ros, y_ros = ros.fit_sample(X, y)

skf = StratifiedKFold(n_splits=5,shuffle = True)

for train, test in skf.split(X_ros, y_ros):
       print('train -  {}   |   test -  {}'.format(
         np.bincount(y_ros[train]), np.bincount(y_ros[test])))
       print(f"y_ros_test  {y_ros[test]}")

output

train -  [36 36]   |   test -  [9 9]
y_ros_test  [0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1]
train -  [36 36]   |   test -  [9 9]
y_ros_test  [0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1]
train -  [36 36]   |   test -  [9 9]
y_ros_test  [0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1]
train -  [36 36]   |   test -  [9 9]
y_ros_test  [0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1]
train -  [36 36]   |   test -  [9 9]
y_ros_test  [0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1]

My questions:

  1. where we define train and test split (80%, 20% thing in the stratifiedKfold)? I can see from the straditifiedkfold that n_splits is defining the number of folds but not the split I think. This part confuses me.

  2. Why I'm getting y_ros_test with 9 0's and 9 1's when I have n_splits=5? According to explorations it should be 50/5 = 10 , so isn't it 5 1's and 5 0's in each split ?


Solution

  • Regarding your first question: there is not any train-test split when using cross-validation (CV); what happens is, in each CV round, one fold is used as a test set and the rest as training. As a result, when n_splits=5, like here, in each round 1/5 (i.e. 20%) of the data is used as test set while the remaining 4/5 (i.e. 80%) for training. So yes, determining the n_splits argument uniquely defines the split, and there is no need for any further determination (for n_splits=4 you would get a 75-25 split).

    Regarding your second question, you seem to forget that previous to splitting you have oversampled your data. Running your code with the initial X and y (i.e. without oversampling) gives indeed a y_test of size 50/5 = 10, although this is not balanced (balancing is the result of oversampling) but stratified (each fold retains the class analogy of the original data):

    skf = StratifiedKFold(n_splits=5,shuffle = True)
    
    for train, test in skf.split(X, y):
           print('train -  {}   |   test -  {}'.format(
             np.bincount(y[train]), np.bincount(y[test])))
           print(f"y_test  {y[test]}")
    

    Result:

    train -  [36  4]   |   test -  [9 1]
    y_test  [0 0 0 0 0 0 0 0 0 1]
    train -  [36  4]   |   test -  [9 1]
    y_test  [0 0 0 0 0 0 0 0 0 1]
    train -  [36  4]   |   test -  [9 1]
    y_test  [0 0 0 0 0 0 0 0 0 1]
    train -  [36  4]   |   test -  [9 1]
    y_test  [0 0 0 0 0 0 0 0 0 1]
    train -  [36  4]   |   test -  [9 1]
    y_test  [0 0 0 0 0 0 0 0 0 1]
    

    Since oversampling the minority class actually increases the size of the dataset, it is only expected that you get a y_ros_test that is larger relevant to y_test (here 18 samples instead of 10).

    Methodologically speaking, you actually don't need a stratified sampling if you already have oversampled your data to balance the class representation.