python machine-learning scikit-learn classification logistic-regression

Learning curves - Why does the training accuracy start so high, then suddenly drop?

I implemented a model in which I use Logistic Regression as classifier and I wanted to plot the learning curves for both training and test sets to decide what to do next in order to improve my model.

Just to give you some information, to do plot the learning curve I defined a function that takes a model, a pre-split dataset (train/test X and Y arrays, NB: using train_test_split function), a scoring function as input and iterates through the dataset training on n exponentially spaced subsets and returns the learning curves.

My results are in the below image

I wonder why does the training accuracy start so high, then suddenly drop, then start to rise again as training set size increases? And conversely for the test accuracy. I thought extremely good accuracy and the fall was because of some noise due to small datasets in the beginning and then when datasets became more consistent it started to rise but I am not sure. Can someone explain this?

And finally, can we assume that these results mean a low variance/moderate bias (70% accuracy in my context is not that bad) and so to improve my model I must resort to ensemble methods or extreme feature engineering?

Solution

I think what happens is you are overfitting the training samples when the dataset is small (very high training accuracy, low test accuracy). As you grow the dataset size, your classifier starts to generalize better, thus raising the success rate in the test dataset.

After 10^3 dataset, the accuracy seems to level off at 70%, which suggests you achieved a good balance between overfitting the training and underfitting the test dataset