python numpy machine-learning scikit-learn sklearn-pandas

create training validation split using sklearn

I have a training set consisting of X and Y, The X is of shape (4000,32,1) and Y is of shape (4000,1).

I would like to create a training/validation set based on split. Here is what I have been trying to do

from sklearn.model_selection import StratifiedShuffleSplit
sss = StratifiedShuffleSplit(test_size=0.1, random_state=23)
for train_index, valid_index in sss.split(X, Y):
    X_train, X_valid = X[train_index], X[valid_index]
    y_train, y_valid = Y[train_index], Y[valid_index]

Running the program gives the following error message related to the above code segment

for train_index, valid_index in sss.split(X, Y):
ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.

I am not very clear about the above error message, what's the right way to create a training/validation split for the training set as above?

Solution

It's a little bit weird because I copy/pasted your code with sklearn's breast cancer dataset as follow

from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
X, Y = cancer.data, cancer.target

from sklearn.model_selection import StratifiedShuffleSplit
sss = StratifiedShuffleSplit(test_size=0.1, random_state=23)
for train_index, valid_index in sss.split(X, Y):
        X_train, X_valid = X[train_index], X[valid_index]
        y_train, y_valid = Y[train_index], Y[valid_index]

Here X.shape = (569, 30) and Y.shape = (569,) and I had no error, for example y_valid.shape = 57 or one tenth of 569.

I suggest you to reshape X into (4000,32) (and so Y into (4000)), because Python may see it as a list of ONE big element (I am using python 2-7 by the way).

To answer your question, you can alternatively use train_test_split

from sklearn.model_selection import train_test_split

which according to the help

Split arrays or matrices into random train and test subsets Quick utility that wraps input validation and ``next(ShuffleSplit().split(X, y))`

Basically a wrapper of what you wanted to do. You can then specify the training and the test sizes, the random_state, if you want to stratify your data or to shuffle it etc.

It's easy to use for example:

X_train, X_valid, y_train, y_valid = train_test_split(X,Y, test_size = 0.1, random_state=0)