Search code examples
pythonpandasscikit-learnjupytersklearn-pandas

"The least populated class in y has only 1 ... groups for any class cannot be less than 2." Without train_test_split()


I am trying to run this code, using a dataset on the relation of Corona cases to Corona deaths. I have not found any reason why the error should appear through the way i handle the split into X and y dataframes, but I do not fully understand the Error either.

Does someone know what is wrong here?

import numpy as np
#from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn import preprocessing


#import csv
X_test = pd.read_csv("test.csv")
y_output = pd.read_csv("sample_submission.csv")

data_train = pd.read_csv("train.csv")
X_train = data_train.drop(columns=["Next Week's Deaths"])
y_train = data_train["Next Week's Deaths"]

#prepare for fit (transform Location strings into classes)
Location = data_train["Location"]
le = preprocessing.LabelEncoder()
le.fit(Location)

LocationToInt = le.transform(Location)
LocationDict = dict(zip(Location, LocationToInt))

X_train["Location"] = X_train["Location"].replace(LocationDict)


#train and run
model = HistGradientBoostingClassifier(max_bins=255, max_iter=100)
model.fit(X_train, y_train)

Traceback:

Input In [89], in <cell line: 29>()
     27 #train and run
     28 model = HistGradientBoostingClassifier(max_bins=255, max_iter=100)
---> 29 model.fit(X_train, y_train)

File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\ensemble\_hist_gradient_boosting\gradient_boosting.py:348, in BaseHistGradientBoosting.fit(self, X, y, sample_weight)
    343 # Save the state of the RNG for the training and validation split.
    344 # This is needed in order to have the same split when using
    345 # warm starting.
    347 if sample_weight is None:
--> 348     X_train, X_val, y_train, y_val = train_test_split(
    349         X,
    350         y,
    351         test_size=self.validation_fraction,
    352         stratify=stratify,
    353         random_state=self._random_seed,
    354     )
    355     sample_weight_train = sample_weight_val = None
    356 else:
    357     # TODO: incorporate sample_weight in sampling here, as well as
    358     # stratify

File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\model_selection\_split.py:2454, in train_test_split(test_size, train_size, random_state, shuffle, stratify, *arrays)
   2450         CVClass = ShuffleSplit
   2452     cv = CVClass(test_size=n_test, train_size=n_train, random_state=random_state)
-> 2454     train, test = next(cv.split(X=arrays[0], y=stratify))
   2456 return list(
   2457     chain.from_iterable(
   2458         (_safe_indexing(a, train), _safe_indexing(a, test)) for a in arrays
   2459     )
   2460 )

File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\model_selection\_split.py:1613, in BaseShuffleSplit.split(self, X, y, groups)
   1583 """Generate indices to split data into training and test set.
   1584 
   1585 Parameters
   (...)
   1610 to an integer.
   1611 """
   1612 X, y, groups = indexable(X, y, groups)
-> 1613 for train, test in self._iter_indices(X, y, groups):
   1614     yield train, test

File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\model_selection\_split.py:1953, in StratifiedShuffleSplit._iter_indices(self, X, y, groups)
   1951 class_counts = np.bincount(y_indices)
   1952 if np.min(class_counts) < 2:
-> 1953     raise ValueError(
   1954         "The least populated class in y has only 1"
   1955         " member, which is too few. The minimum"
   1956         " number of groups for any class cannot"
   1957         " be less than 2."
   1958     )
   1960 if n_train < n_classes:
   1961     raise ValueError(
   1962         "The train_size = %d should be greater or "
   1963         "equal to the number of classes = %d" % (n_train, n_classes)
   1964     )

ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.

For Text Highlighting: Picture of Traceback


Solution

  • The HistGradientBoostingClassifier internally splits your dataset into train and validation. Default is 10% for validation (checkout validation_fraction param in docs).

    In your case, there is a class with a single element on it, so if it goes to the train split, the classifier can't validate this class, or vice versa. The point is: you need at least two examples in each class.

    How to solve it? Well, first you need an appropriated diagnosis: run the following code to see which class is the problem:

    import bumpy as np
    
    unq, cnt = no.unique(y_train, return_counts=True)
    
    for u, c in zip(unq, cnt):
        print(f"class {u} contains {c}")
    

    What to do now? Well, first make sure that those results make sense to you, and there is no a previous error (maybe reading incorrectly your CSV or loosing data some steps before).

    Then, if the problem persist, your options are the following:

    • Collect more data. Not always possible but this is the best.

    • Add synthetic data. imblearn for instance, is a sklearn-like library to work on imbalanced problems like yours. It provides several well known oversampling methods. You can also create your own synthetic data, since you know what is it.

    • Remove classes with a single example. This implies re-framing your problem a little bit but may work. Just drop the row. You can also re-label it to one of the closest labels, for instance, if you have classes positives, negatives and neutral, and a single example of neutral class, well maybe you can re-label it as negative.

    • Group classes with low cardinality. This is useful when you have multiple classes, let's say 10 classes, and there are some of those, let's say 3, with really few examples. You can Mix those low cardinality classes into a single class "other" and convert your problem to another similar with less classes but more populated, in the example, instead of 10 you will have 8.

    What is the best alternative? It really depends on your problem.

    EDIT Previous answer assumes you are solving a classification problem (tell which class an example belongs to). If you are solving a regression task (predict a quantity), replace your HistGradientBoostingClassifier with HistGradientBoostingRegressor