Search code examples
pythonnumpymachine-learningscikit-learnsklearn-pandas

create training validation split using sklearn


I have a training set consisting of X and Y, The X is of shape (4000,32,1) and Y is of shape (4000,1).

I would like to create a training/validation set based on split. Here is what I have been trying to do

from sklearn.model_selection import StratifiedShuffleSplit
sss = StratifiedShuffleSplit(test_size=0.1, random_state=23)
for train_index, valid_index in sss.split(X, Y):
    X_train, X_valid = X[train_index], X[valid_index]
    y_train, y_valid = Y[train_index], Y[valid_index]

Running the program gives the following error message related to the above code segment

for train_index, valid_index in sss.split(X, Y):
ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.

I am not very clear about the above error message, what's the right way to create a training/validation split for the training set as above?


Solution

  • It's a little bit weird because I copy/pasted your code with sklearn's breast cancer dataset as follow

    from sklearn.datasets import load_breast_cancer
    cancer = load_breast_cancer()
    X, Y = cancer.data, cancer.target
    
    from sklearn.model_selection import StratifiedShuffleSplit
    sss = StratifiedShuffleSplit(test_size=0.1, random_state=23)
    for train_index, valid_index in sss.split(X, Y):
            X_train, X_valid = X[train_index], X[valid_index]
            y_train, y_valid = Y[train_index], Y[valid_index]
    

    Here X.shape = (569, 30) and Y.shape = (569,) and I had no error, for example y_valid.shape = 57 or one tenth of 569.

    I suggest you to reshape X into (4000,32) (and so Y into (4000)), because Python may see it as a list of ONE big element (I am using python 2-7 by the way).

    To answer your question, you can alternatively use train_test_split

    from sklearn.model_selection import train_test_split
    

    which according to the help

    Split arrays or matrices into random train and test subsets Quick utility that wraps input validation and ``next(ShuffleSplit().split(X, y))`

    Basically a wrapper of what you wanted to do. You can then specify the training and the test sizes, the random_state, if you want to stratify your data or to shuffle it etc.

    It's easy to use for example:

    X_train, X_valid, y_train, y_valid = train_test_split(X,Y, test_size = 0.1, random_state=0)