I have a training set consisting of X and Y, The X is of shape (4000,32,1) and Y is of shape (4000,1).
I would like to create a training/validation set based on split. Here is what I have been trying to do
from sklearn.model_selection import StratifiedShuffleSplit
sss = StratifiedShuffleSplit(test_size=0.1, random_state=23)
for train_index, valid_index in sss.split(X, Y):
X_train, X_valid = X[train_index], X[valid_index]
y_train, y_valid = Y[train_index], Y[valid_index]
Running the program gives the following error message related to the above code segment
for train_index, valid_index in sss.split(X, Y):
ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.
I am not very clear about the above error message, what's the right way to create a training/validation split for the training set as above?
It's a little bit weird because I copy/pasted your code with sklearn's breast cancer dataset as follow
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
X, Y = cancer.data, cancer.target
from sklearn.model_selection import StratifiedShuffleSplit
sss = StratifiedShuffleSplit(test_size=0.1, random_state=23)
for train_index, valid_index in sss.split(X, Y):
X_train, X_valid = X[train_index], X[valid_index]
y_train, y_valid = Y[train_index], Y[valid_index]
Here X.shape = (569, 30)
and Y.shape = (569,)
and I had no error, for example y_valid.shape = 57
or one tenth of 569.
I suggest you to reshape X into (4000,32) (and so Y into (4000)), because Python may see it as a list of ONE big element (I am using python 2-7 by the way).
To answer your question, you can alternatively use train_test_split
from sklearn.model_selection import train_test_split
which according to the help
Split arrays or matrices into random train and test subsets Quick utility that wraps input validation and ``next(ShuffleSplit().split(X, y))`
Basically a wrapper of what you wanted to do. You can then specify the training and the test sizes, the random_state, if you want to stratify your data or to shuffle it etc.
It's easy to use for example:
X_train, X_valid, y_train, y_valid = train_test_split(X,Y, test_size = 0.1, random_state=0)