python scikit-learn dataset cross-validation

How to correctly split unbalanced dataset incorporating train test and cross validation set

The picture above is what I'm trying to replicate. I just don't know if I'm going about it the right way. I'm working with the FakeNewsChallenge dataset and its extremely unbalanced, and I'm trying to replicate and improve on a method used in a paper.

Agree - 7.36%

Disagree - 1.68%

Discuss - 17.82%

Unrelated - 73.13%

I'm splitting the data in this way:

(split dataset into 67/33 split)

train 67%, test 33%

(split training further 80/20 for validation)

training 80%, validation 20%

(Then split training and validation using 3 fold cross validation set)

As an aside, getting that 1.68% of disagree and agree has been extremely difficult.

This is where I'm having an issue as it's not making total sense to me. Is the validation set created in the 80/20 split being stratified as well in the 5fold?

Here is where I am at currently:

Split data into 67% Training Set and 33% Test Set

x_train1, x_test, y_train1, y_test = train_test_split(x, y, test_size=0.33)

x_train2, x_val, y_train2, y_val = train_test_split(x_train1, y_train1, test_size=0.20)

skf = StratifiedKFold(n_splits=3, shuffle = True)
skf.getn_splits(x_train2, y_train2)

for train_index, test_index in skf.split(x_train2, y_train2):
  x_train_cros, x_test_cros = x_train2[train_index], x_train2[test_index]
  y_train_cros, y_test_cros = y_train2[train_index], y_train[test_index]

Would I run skf again for the validation set as well? Where are the test test sets from skf created being used in sequential model?

Citation for the method I'm using:

Thota, Aswini; Tilak, Priyanka; Ahluwalia, Simrat; and Lohia, Nibrat (2018) "Fake News Detection: A Deep Learning Approach," SMU Data Science Review: Vol. 1 : No. 3 , Article 10. Available at: https://scholar.smu.edu/datasciencereview/vol1/iss3/10

Solution

You need to add one more parameter in the function 'train_test_split()':

x_train1, x_test, y_train1, y_test = train_test_split(x, y, test_size=0.33, stratify = y)

This will give you equal distribution of all target categories.