Search code examples
pythonmachine-learningcross-validationsklearn-pandas

sklearn cross validation : The least populated class in y has only 1 members, which is less than n_splits=10


i'm working in a machine learning project and i'm stuck with this warning when i try to use cross validation to know how many neighbours do i need to achieve the best accuracy in knn; here's the warning:

The least populated class in y has only 1 members, which is less than n_splits=10.

The dataset i'm using is https://archive.ics.uci.edu/ml/datasets/Student+Performance

In this dataset we have several attributes, but we'll be using only "G1", "G2", "G3", "studytime","freetime","health","famrel". all the instances in those columns are integers. https://i.sstatic.net/sirSl.png <-dataset example

Next,here's my first chunk of code where i assign the train and test groups:

import pandas as pd
import numpy as np
from google.colab import drive
drive.mount('/gdrive')
import sklearn

data=pd.read_excel("/gdrive/MyDrive/Colab Notebooks/student-por.xls")

#print(data.head())
data = data[["G1", "G2", "G3", "studytime","freetime","health","famrel"]]  
print(data)
predict = "G3"


x = np.array(data.drop([predict], axis=1))  
y = np.array(data[predict])  
print(y)
x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(x, y, test_size=0.3, random_state=42)
print(len(y))
print(len(x))

That's how i assign x and y. with len, i can see that x and y have 649 rows both, representing 649 students.

Here's the second chunk of code when i do the cross_val:

#CROSSVALIDATION
from sklearn.neighbors import KNeighborsClassifier
neighbors = list (range(2,30))
cv_scores=[]
#print(y_train)

from sklearn.model_selection import cross_val_score
import matplotlib.pyplot as plt

for k in neighbors:
    knn = KNeighborsClassifier(n_neighbors=k)
    scores = cross_val_score(knn,x_train,y_train,cv=11,scoring='accuracy')
    cv_scores.append(scores.mean())
plt.plot(cv_scores)
plt.show()```

the code is pretty self explanatory as you can tell

The warning:

The least populated class in y has only 1 members, which is less than n_splits=10.

happens in every iteration of the for-loop

Although this warning happens every time, plt.show() is still able to plot a graph regarding which amount of neighbours is best to achieve a good accuracy, i dont know if the plot, or the readings in cv_scores are accurate.

my question is :

How my "class in y" has only 1 members, len(y) clearly says y have 649 instances, more than enough to be splitted in 59 groups of 11 members each one?, By members is it referring to "instances" in my dataset, or colums/labels in the y group?

I'm not using stratify=y when i do the train/test splits, it's seems to be the 1# solution to this warning but its useless in my case.

I've tried everything i've seen on google/stack overflow and nothing helped me, the dataset seems to be the problem, but i can´t understand whats wrong.


Solution

  • I think your main mistake is that your are using KNeighborsClassifier, and your feature to predict seems to be continuous (G3 - final grade (numeric: from 0 to 20, output target)) and not categorical.

    In this case, every single value of the "y" is taken as a different possible class or label. The message you obtain is saying that in your dataset (on the "y"), there are values that only appears one time. For example, the values 3, appears only one time inside your dataset. This is not an error, but indicates that the model won't work correctly or accurate.

    After all, I strongly reccomend you to use the sklearn.neighbors.KNeighborsRegressor.

    This is the Knn used for "continuous" variables and not classes. Using this model, you shouldn't have this problem anymore. The output value will be the mean between the number of nearest neighbors you defined.

    With this simple changes, your problem will be solved.