I have hard time to understand scikit-learn's StratifiedKfold
from https://scikit-learn.org/stable/modules/cross_validation.html#stratification
and implemented the example part by adding RandomOversample
:
X, y = np.ones((50, 1)), np.hstack(([0] * 45, [1] * 5))
from imblearn.over_sampling import RandomOverSampler
ros = RandomOverSampler(sampling_strategy='minority',random_state=0)
X_ros, y_ros = ros.fit_sample(X, y)
skf = StratifiedKFold(n_splits=5,shuffle = True)
for train, test in skf.split(X_ros, y_ros):
print('train - {} | test - {}'.format(
np.bincount(y_ros[train]), np.bincount(y_ros[test])))
print(f"y_ros_test {y_ros[test]}")
output
train - [36 36] | test - [9 9]
y_ros_test [0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1]
train - [36 36] | test - [9 9]
y_ros_test [0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1]
train - [36 36] | test - [9 9]
y_ros_test [0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1]
train - [36 36] | test - [9 9]
y_ros_test [0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1]
train - [36 36] | test - [9 9]
y_ros_test [0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1]
My questions:
where we define train and test split (80%, 20% thing in the stratifiedKfold)? I can see from the straditifiedkfold that n_splits is defining the number of folds but not the split I think. This part confuses me.
Why I'm getting y_ros_test
with 9 0's
and 9 1's
when I have n_splits=5
?
According to explorations it should be 50/5 = 10 , so isn't it 5 1's
and 5 0's
in each split ?
Regarding your first question: there is not any train-test split when using cross-validation (CV); what happens is, in each CV round, one fold is used as a test set and the rest as training. As a result, when n_splits=5
, like here, in each round 1/5 (i.e. 20%) of the data is used as test set while the remaining 4/5 (i.e. 80%) for training. So yes, determining the n_splits
argument uniquely defines the split, and there is no need for any further determination (for n_splits=4
you would get a 75-25 split).
Regarding your second question, you seem to forget that previous to splitting you have oversampled your data. Running your code with the initial X
and y
(i.e. without oversampling) gives indeed a y_test
of size 50/5 = 10, although this is not balanced (balancing is the result of oversampling) but stratified (each fold retains the class analogy of the original data):
skf = StratifiedKFold(n_splits=5,shuffle = True)
for train, test in skf.split(X, y):
print('train - {} | test - {}'.format(
np.bincount(y[train]), np.bincount(y[test])))
print(f"y_test {y[test]}")
Result:
train - [36 4] | test - [9 1]
y_test [0 0 0 0 0 0 0 0 0 1]
train - [36 4] | test - [9 1]
y_test [0 0 0 0 0 0 0 0 0 1]
train - [36 4] | test - [9 1]
y_test [0 0 0 0 0 0 0 0 0 1]
train - [36 4] | test - [9 1]
y_test [0 0 0 0 0 0 0 0 0 1]
train - [36 4] | test - [9 1]
y_test [0 0 0 0 0 0 0 0 0 1]
Since oversampling the minority class actually increases the size of the dataset, it is only expected that you get a y_ros_test
that is larger relevant to y_test
(here 18 samples instead of 10).
Methodologically speaking, you actually don't need a stratified sampling if you already have oversampled your data to balance the class representation.