Search code examples
modelscikit-learncurvevariance

Do learning curves show overfitting?


I'm trying to know if my classifying model (binary) suffers from overfitting or not, and I got the learning curve. The dataset is: 6836 instances with 1006 insances for the positive class.

1) If I use SMOTE to balance the class and RandomForest as technique, I obtain this curve, and these ratios: TPR=0.887 y FPR=0.041:

Learning curve 1

Note that the training error is flat and almost 0.

2) If I use the function "balanced_subsample" (attached at the end) to balance the class and RandomForest as technique, I obtain this curve, and these ratios: TPR=0.866 y FPR=0.14:

Learning curve 2

Note that in this case the test error is flat.

  • Do the models suffer from overfitting?
  • Which of them do make more sense?

The function "balanced_subsample":

def balanced_subsample(x,y,subsample_size):

class_xs = []
min_elems = None

for yi in np.unique(y):
    elems = x[(y == yi)]
    class_xs.append((yi, elems))
    if min_elems == None or elems.shape[0] < min_elems:
        min_elems = elems.shape[0]

use_elems = min_elems
if subsample_size < 1:
    use_elems = int(min_elems*subsample_size)

xs = []
ys = []

for ci,this_xs in class_xs:
    if len(this_xs) > use_elems:
        np.random.shuffle(this_xs)

    x_ = this_xs[:use_elems]
    y_ = np.empty(use_elems)
    y_.fill(ci)

    xs.append(x_)
    ys.append(y_)

xs = np.concatenate(xs)
ys = np.concatenate(ys)

return xs,ys

EDIT1: More info about the code ans the process

X = data
y = X.pop('myclass')


#There is categorical and numerical attributes in my data set, so here I vectorize the categorical attributes
arrX = vectorize_attributes(X)

#Here I use some code to balance my class using SMOTE or "balanced_subsample" approach
X_train_balanced, y_train_balanced=mySMOTEfunc(arrX, y)
#X_train_balanced, y_train_balanced=balanced_subsample(arrX, y) 

#TRAIN/TEST SPLIT (STRATIFIED K_FOLD is implicit)
X_train,X_test,y_train,y_test = train_test_split(X_train_balanced,y_train_balanced,test_size=0.25)

#Estimator
clf=RandomForestClassifier(random_state=np.random.seed()) 
param_grid = { 'n_estimators': [10,50,100,200,300],'max_features': ['auto', 'sqrt', 'log2']}

#Grid search
score_func = metrics.f1_score
CV_clf = GridSearchCV(estimator=clf, param_grid=param_grid, cv=10)
start = time()
CV_clf.fit(X_train, y_train)

#FIT & PREDICTION
model = CV_clf.best_estimator_
y_pred = model.predict(X_test)

EDIT2: In this case, I try it with Gradient Boosting Classifier (GBC) in 3 scenarios: 1) GBC + SMOTE, 2) GBC + SMOTE + feature selection, and 3) GBC + SMOTE + feature selection + normalization

X = data
y = X.pop('myclass')

#There is categorical and numerical attributes in my data set, so here I vectorize the categorical attributes
arrX = vectorize_attributes(X)

#FOR SCENARIO 3: Normalization
standardized_X = preprocessing.normalize(arrX)

#FOR SCENARIO 2 y 3: Removing all but the k highest scoring features
arrX_features_selected = SelectKBest(chi2, k=5).fit_transform(standardized_X , y)

#Here I use some code to balance my class using SMOTE or "balanced_subsample" approach
X_train_balanced, y_train_balanced=mySMOTEfunc(arrX_features_selected , y)
#X_train_balanced, y_train_balanced=balanced_subsample(arrX_features_selected , y) 

#TRAIN/TEST SPLIT (STRATIFIED K_FOLD is implicit)
X_train,X_test,y_train,y_test = train_test_split(X_train_balanced,y_train_balanced,test_size=0.25)

#Estimator
clf=RandomForestClassifier(random_state=np.random.seed()) 
param_grid = { 'n_estimators': [10,50,100,200,300],'max_features': ['auto', 'sqrt', 'log2']}

#Grid search
score_func = metrics.f1_score
CV_clf = GridSearchCV(estimator=clf, param_grid=param_grid, cv=10)
start = time()
CV_clf.fit(X_train, y_train)

#FIT & PREDICTION
model = CV_clf.best_estimator_
y_pred = model.predict(X_test)

The learning curves of the 3 proposed scenarios are:

SCENARIO 1: scenario1

SCENARIO 2: GBC + SMOTE + feature selection enter image description here

SCENARIO 3: GBC + SMOTE + feature selection + normalization enter image description here


Solution

  • So, your first curve makes sense. You expect test error to come down as you increase training points. And you expect uniformly near-0 train error when you have a random forest of trees with no maximum depth and 100% max samples. You probably are over fit, but you probably aren't going to get much better with RandomForests (or, depending on the data set, anything else).

    Your second curve does not make sense. You should again get near-0 train error, unless something totally wonky is going on (like a really broken input set). I can't see anything wrong with your code, and I ran you function; seems to work fine. Short of you posting the complete working example with code, nothing more I can do.