Search code examples
pythonpandasscikit-learndata-sciencesklearn-pandas

What does this error mean with StratifiedShuffleSplit?


I'm totally new to Data Science in general and was hoping someone could explain why this does not work:

I'm using the Advertising dataset from the following url: "http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv" which has 3 feature columns ("TV", "Radio", "Newspaper") and 1 label column ("sales"). My complete dataset is named data.

Next, I try to use sklearn's StratifiedShuffleSplit function to divide the data into training and testing sets.

from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits=1, random_state=0) # can use test_size=0.8
for train_index, test_index in split.split(data.drop("sales", axis=1), data["sales"]): # Generate indices to split data into training and test set.
    strat_train_set = data.loc[train_index]
    strat_test_set = data.loc[test_index]

I get this ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.

Using the same code on another dataset which has 14 feature columns and 1 label column separates the data appropriately. Why doesn't it work here? Thanks.


Solution

  • I think that problem is your data_y is 2D matrix.

    but as I see in sklearn.model_selection.StratifiedShuffleSplit doc, it should be the 1D vector. Try to encode each row of data_y as the integer (it will be interpreted as a class), and after use split.

    Or possibly your y is a regression variable (continuous numerical data).(Vivek's link)