I am performing a url classification (phishing - nonphishing) and I plotted the learning curves (training vs cross validation score) for my model (Gradient Boost).
My View
It seems that these two curves converge and the difference is not significant. Tt's normal for the training set to have a slightly higher accuracy). (Figure 1)
The Question
I have limited experience on machine learning, thus I am asking your opinion. Is the way I am approaching the problem right? Is this model fine or is it overfitting?
Note: The classes are balanced and the features are well chosen
Relevant code
from yellowbrick.model_selection import LearningCurve
def plot_learning_curves(Χ, y, model):
# Create the learning curve visualizer
cv = StratifiedKFold(n_splits=5)
sizes = np.linspace(0.1, 1.0, 8)
visualizer = LearningCurve(model, cv=cv, train_sizes=sizes, n_jobs=4)
visualizer.fit(Χ, y) # Fit the data to the visualizer
visualizer.poof()
Firstly, in your graph there are 8 different models.
It's hard to tell if one of them is overfitting because overfitting can be detected with a "epoch vs performance (train / valid)" graph (there would be 8 in your case).
Overfitting means that, after a certain number of epochs, as the number of epoch increases, training accuracy goes up while validation accuracy goes down. This can be the case, for example, when you have too few data points regarding the complexity of your problem, hence your model is using spurious correlations.
With your graph, what we can say is that the complexity of your problem seems to require a "high" number or training instances because your validation performance keep increasing as you add more training instances. There is a chance that the model with <10000 is overfitting but your >50000 could be overiftting too and we don't see that because you are using early stopping!