Search code examples
pythonmachine-learningscikit-learnadaboost

AdaBoostClassifier: Perfect Metrics with test_size=0.25, but Inconsistent Samples Error for Other Values


I'm using AdaBoostClassifier with a weak learner (DecisionTreeClassifier) to classify a dataset. The dataset has 7857 samples:

X.shape
# Output: (7857, 5)

y.shape
# Output: (7857,)

Here’s the code for splitting the dataset and training the model:

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=28
)

weak_learner = DecisionTreeClassifier(max_depth=1)

adb = AdaBoostClassifier(estimator=weak_learner, n_estimators=50, random_state=42)
adb_model = adb.fit(X_train, y_train)

y_pred = adb_model.predict(X_test)
print(classification_report(y_test, y_pred))

When I run this code with test_size=0.25, the output for the classification metrics is 100% for all categories:

              precision    recall  f1-score   support

       Cheap       1.00      1.00      1.00       496
   Expensive       1.00      1.00      1.00       506
  Reasonable       1.00      1.00      1.00       963

    accuracy                           1.00      1965
   macro avg       1.00      1.00      1.00      1965
weighted avg       1.00      1.00      1.00      1965

This cannot be true, as my data points are not perfectly separable. (I checked with a graph)

However, when I change the test_size to any other value (e.g., 0.3, 0.2), I get the following error:

ValueError: Found input variables with inconsistent numbers of samples

What I've Checked:

  1. Ensured that X and y have the same number of samples.
  2. Confirmed there are no missing values in X or y.

Questions:

  1. Why does test_size=0.25 produce perfect metrics, but other test_size values result in an error?
  2. How can I fix this issue to use different test_size values?

Solution

  • The test_size = 0.25 doesn’t really have an impact on the metrics, I don’t think. But your model is too good, probably because the function that the label follows is very simple. So your Adaboost doesn’t need more than a DecisionTree(depth=1) to learn that function.

    But, you’re probably using the same name y_pred in all the code and thus, y_pred should be actualized when you make any changes on the model or on the test dataset. The test_size actually modifies the length of the test dataset and the previous length was 25% of the total size of the dataset. If you modify the test_size, you modify that test dataset and you need to actualize y_pred with adb.predict(X_test), both for having new prediction matching with new datapoints, and not have a mismatch error.

    To fix the issue, you just need to to add the following before using any function that needs y_test and y_pred:

    y_pred = adb.predict(X_test)
    

    Screenshot of code